Apparatus and method for automated protein design

ABSTRACT

The present invention relates to apparatus and methods for quantitative protein design and optimization.

This application is a continuation application of U.S. Ser. No.09/837,886 filed Apr. 18, 2001, which is a continuation application ofU.S. Ser. No. 09/714,357 filed Nov. 15, 2000, now U.S. Pat. No.6,708,120, which is a divisional of U.S. Ser. No. 09/058,459 filed Apr.10, 1998, now U.S. Pat. No. 6,188,965, which claims the benefit of thefiling date under 35 U.S.C. § 119(e) of U.S. Ser. No. 60/043,464 filedApr. 11, 1997, U.S. Ser. No. 60/054,678 filed on Aug. 4, 1997, and U.S.Ser. No. 60/061,097 filed on October 3.

FIELD OF THE INVENTION

The present invention relates to an apparatus and method forquantitative protein design and optimization.

BACKGROUND OF THE INVENTION

De novo protein design has received considerable attention recently, andsignificant advances have been made toward the goal of producing stable,well-folded proteins with novel sequences. Efforts to design proteinsrely on knowledge of the physical properties that determine proteinstructure, such as the patterns of hydrophobic and hydrophilic residuesin the sequence, salt bridges and hydrogen bonds, and secondarystructural preferences of amino acids. Various approaches to apply theseprinciples have been attempted. For example, the construction ofα-helical and β-sheet proteins with native-like sequences was attemptedby individually selecting the residue required at every position in thetarget fold (Hecht, et al., Science 249:884-891 (1990); Quinn, et al.,Proc. Natl. Acad. Sci USA 91:8747-8751 (1994)). Alternatively, aminimalist approach was used to design helical proteins, where thesimplest possible sequence believed to be consistent with the foldedstructure was generated (Regan, et al., Science 241:976-978 (1988);DeGrado, et al., Science 243:622-628 (1989); Handel, et al., Science261:879-885 (1993)), with varying degrees of success. An experimentalmethod that relies on the hydrophobic and polar (HP) pattern of asequence was developed where a library of sequences with the correctpattern for a four helix bundle was generated by random mutagenesis(Kamtekar, et al., Science 262:1680-1685 (1993)). Among non de novoapproaches, domains of naturally occurring proteins have been modifiedor coupled together to achieve a desired tertiary organization (Pessi,et al., Nature 362:367-369 (1993); Pomerantz, et al., Science 267:93-96(1995)).

Though the correct secondary structure and overall tertiary organizationseem to have been attained by several of the above techniques, manydesigned proteins appear to lack the structural specificity of nativeproteins. The complementary geometric arrangement of amino acids in thefolded protein is the root of this specificity and is encoded in thesequence.

Several groups have applied and experimentally tested systematic,quantitative methods to protein design with the goal of developinggeneral design algorithms (Hellinga, et al., J. Mol. Biol. 222: 763-785(1991); Hurley, et al., J. Mol. Biol. 224:1143-1154 (1992); Desjarlaisl,et al., Protein Science 4:2006-2018 (1995); Harbury, et al., Proc. Natl.Acad. Sci. USA 92:8408-8412 (1995); Klemba, et al., Nat. Struc. Biol.2:368-373 (1995); Nautiyal, et al., Biochemistry 34:11645-11651 (1995);Betzo, et al., Biochemistry 35:6955-6962 (1996); Dahiyat, et al.,Protein Science 5:895-903 (1996); Jones, Protein Science 3:567-574(1994); Konoi, et al., Proteins: Structure, Function and Genetics19:244-255 (1994)). These algorithms consider the spatial positioningand steric complementarity of side chains by explicitly modeling theatoms of sequences under consideration. To date, such techniques havetypically focused on designing the cores of proteins and have scoredsequences with van der Waals and sometimes hydrophobic solvationpotentials.

In addition, the qualitative nature of many design approaches hashampered the development of improved, second generation, proteinsbecause there are no objective methods for learning from past designsuccesses and failures.

Thus, it is an object of the invention to provide computational proteindesign and optimization via an objective, quantitative design techniqueimplemented in connection with a general purpose computer.

SUMMARY OF THE INVENTION

In accordance with the objects outlined above, the present inventionprovides methods executed by a computer under the control of a program,the computer including a memory for storing the program. The methodcomprising the steps of receiving a protein backbone structure withvariable residue positions, establishing a group of potential rotamersfor each of the variable residue positions, wherein at least onevariable residue position has rotamers from at least two different aminoacid side chains, and analyzing the interaction of each of the rotamerswith all or part of the remainder of the protein backbone structure togenerate a set of optimized protein sequences. The methods furthercomprise classifying each variable residue position as either a core,surface or boundary residue. The analyzing step may include a Dead-EndElimination (DEE) computation. Generally, the analyzing step includesthe use of at least one scoring function selected from the groupconsisting of a Van der Waals potential scoring function, a hydrogenbond potential scoring function, an atomic solvation scoring function, asecondary structure propensity scoring function and an electrostaticscoring function. The methods may further comprise generating a rankordered list of additional optimal sequences from the globally optimalprotein sequence. Some or all of the protein sequences from the orderedlist may be tested to produce potential energy test results.

In an additional aspect, the invention provides nucleic acid sequencesencoding a protein sequence generated by the present methods, andexpression vectors and host cells containing the nucleic acids.

In a further aspect, the invention provides a computer readable memoryto direct a computer to function in a specified manner, comprising aside chain module to correlate a group of potential rotamers for residuepositions of a protein backbone model, and a ranking module to analyzethe interaction of each of said rotamers with all or part of theremainder of said protein to generate a set of optimized proteinsequences. The memory may further comprise an assessment module toassess the correspondence between potential energy test results andtheoretical potential energy data.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates a general purpose computer configured in accordancewith an embodiment of the invention.

FIG. 2 illustrates processing steps associated with an embodiment of theinvention.

FIG. 3 illustrates processing steps associated with a ranking moduleused in accordance with an embodiment of the invention. After any DEEstep, any one of the previous DEE steps may be repeated. In addition,any one of the DEE steps may be eliminated; for example, originalsingles DEE (step 74) need not be run.

FIG. 4 depicts the protein design automation cycle.

FIG. 5 depicts the helical wheel diagram of a coiled coil. One heptadrepeat is shown viewed down the major axes of the helices. The a and dpositions define the solvent-inaccessible core of the molecule (Cohen &Parry, 1990, Proteins, Structure, Function and Genetics 7:1-15).

FIGS. 6A and 6B depict the comparison of simulation cost functions toexperimental Tm's. FIG. 6A depicts the initial cost function, whichcontains only a van der Waals term for the eight PDA peptides. FIG. 6Bdepicts the improved cost function containing polar and nonpolar surfacearea terms weighted by atomic solvation parameters derived from QSARanalysis; 16 cal/mol/Å² favors hydrophobic surface burial.

FIG. 7 shows the rank correlation of energy predicted by the simulationmodule versus the combined activity score of λ repressor mutants (Lim,et al., J. Mol. Biol. 219:359-376 (1991); Hellinga, et al., Proc. Natl.Acad. Sci. USA 91:5803-5807 (1994)).

FIG. 8 shows the sequence of pda8d (SEQ ID NO:2) aligned with the secondzinc finger of Zif268 (SEQ ID NO:1). The boxed ositions were designedusing the sequence selection algorithm. The coordinates of PDB record1zaa (Paveletch, et al., Science 252:809-817 (1991)) from residues 33-60were used as the structure template. In our numbering, position 1corresponds to 1zaa position 33.

FIGS. 9A and 9B shows the NMR spectra and solution secondary structureof pda8d from Example 3. FIG. 9A is the TOCSY Hα-HN fingerprint regionof pda8d. FIG. 9B is the NMR NOE connectivities of pda8d. Bars representunambiguous connectivities and the bar thickness of the sequentialconnections is indexed to the intensity of the resonance.

FIGS. 10A and 10B depict the secondary structure content and thermalstability of α90, α85, α70 and α107. FIG. 10A depicts the far UV spectra(circular dichroism). FIG. 10B depicts the thermal denaturationmonitored by CD.

FIG. 11 depicts the sequence of FSD-1 (SEQ ID NO:3) of Example 5 alignedwith the second zinc finger of Zif268 (SEQ ID NO:1). The bar at the topof the figure shows the residue position classifications: solid barsindicate core positions, hatched bars indicate boundary positions andopen bars indicate surface positions. The alignment matches positions ofFSD-1 (SEQ ID NO:3) to the corresponding backbone template positions ofZif268 (SEQ ID NO:1). Of the six identical positions (21%) between FSD-1(SEQ ID NO:3) and Zif268 (SEQ ID NO:1), four are buried (Ile7, Phe12,Leu18 and Ile22). The zinc binding residues of Zif268 (SEQ ID NO:1) areboxed. Representative non-optimal sequence solutions determined using aMonte Carlo simulated annealing protocol are shown with their rank (SEQID NO:4 to SEQ ID NO:22). Vertical lines indicate identity with FSD-1(SEQ ID NO:3). The symbols at the bottom the figure show the degee ofsequence conservation for each residue position computed across the top1000 sequences: filled circles indicate greater than 99% conservation,half-filled circles indicate conservation between 90 and 99%, opencircles indicate conservation between 50 and 90%, and the absence ofsymbol indicates less than 50% conservation. The consensus sequencedetermined by choosing the amino acid with the highest occurrence ateach position is identical to the sequence of FSD-1 (SEQ ID NO:3).

FIG. 12 is a schematic representation of the minimum and maximumquantities (defined in Eq. 24 to 27) that are used to construct speedenhancements. The minima and maxima are utilized directly to find the/i_(u)i_(v)/_(mb) pair and for the comparison of extrema. Thedifferences between the quantities, denoted with arrows, are used toconstruct the q_(rs) and q_(uv) metrics.

FIGS. 13A, 13B, 13C, 13D, 13E and 13F depicts the areas involved incalculating the buried and exposed areas of Equations 18 and 19. Thedashed box is the protein template, the heavy solid lines correspond tothree rotamers at three different residue positions, and the lightersolid lines correspond to surface areas. a) A_(i_(r)t3)⁰for each rotamer. b) A_(i) _(r) _(t) for each rotamer. c)(A_(i_(r)t3)⁰ − A_(i_(r)t))summed over the three residues. The upper residue does not bury any areaagainst the template except that buried in the tri-peptide stateA_(i_(r)t3)⁰.d) A_(i) _(r) _(j) _(s) _(t) for one pair of rotamers. e) The areaburied between rotamers, (A_(i) _(r) _(t)+A_(j) _(s) _(t)−A_(i) _(r)_(j) _(s) _(t)), for the same pair of rotamers as in (d). f) The areaburied between rotamers, (A_(i) _(r) _(t)+A_(j) _(s) _(t)−A_(i) _(r)_(j) _(s) _(t)), summed over the three pairs of rotamers. The area bintersected by all three rotamers is counted twice and is indicated bythe double lines. The buried area calculated by Equation 18 is the areaburied by the template, represented in (c), plus s times the area buriedbetween rotamers, represented in (f). The scaling factor s accounts forthe over-counting shown by the double lines in (f). The exposed areacalculated by Equation 19 is the exposed are in the presence of thetemplate, represented in (b), minus s times the area buried betweenrotamers, represented in (f).

DETAILED DESCRIPTION OF THE INVENTION

The present invention is directed to the quantitative design andoptimization of amino acid sequences, using an “inverse protein folding”approach, which seeks the optimal sequence for a desired structure.Inverse folding is similar to protein design, which seeks to find asequence or set of sequences that will fold into a desired structure.These approaches can be contrasted with a “protein folding” approachwhich attempts to predict a structure taken by a given sequence.

The general preferred approach of the present invention is as follows,although alternate embodiments are discussed below. A known proteinstructure is used as the starting point. The residues to be optimizedare then identified, which may be the entire sequence or subset(s)thereof. The side chains of any positions to be varied are then removed.The resulting structure consisting of the protein backbone and theremaining sidechains is called the template. Each variable residueposition is then preferably classified as a core residue, a surfaceresidue, or a boundary residue; each classification defines a subset ofpossible amino acid residues for the position (for example, coreresidues generally will be selected from the set of hydrophobicresidues, surface residues generally will be selected from thehydrophilic residues, and boundary residues may be either). Each aminoacid can be represented by a discrete set of all allowed conformers ofeach side chain, called rotamers. Thus, to arrive at an optimal sequencefor a backbone, all possible sequences of rotamers must be screened,where each backbone position can be occupied either by each amino acidin all its possible rotameric states, or a subset of amino acids, andthus a subset of rotamers.

Two sets of interactions are then calculated for each rotamer at everyposition: the interaction of the rotamer side chain with all or part ofthe backbone (the “singles” energy, also called the rotamer/template orrotamer/backbone energy), and the interaction of the rotamer side chainwith all other possible rotamers at every other position or a subset ofthe other positions (the “doubles” energy, also called therotamer/rotamer energy). The energy of each of these interactions iscalculated through the use of a variety of scoring functions, whichinclude the energy of van der Waal's forces, the energy of hydrogenbonding, the energy of secondary structure propensity, the energy ofsurface area solvation and the electrostatics. Thus, the total energy ofeach rotamer interaction, both with the backbone and other rotamers, iscalculated, and stored in a matrix form. The discrete nature of rotamersets allows a simple calculation of the number of rotamer sequences tobe tested. A backbone of length n with m possible rotamers per positionwill have m^(n) possible rotamer sequences, a number which growsexponentially with sequence length and renders the calculations eitherunwieldy or impossible in real time. Accordingly, to solve thiscombinatorial search problem, a “Dead End Elimination” (DEE) calculationis performed. The DEE calculation is based on the fact that if the worsttotal interaction of a first rotamer is still better than the best totalinteraction of a second rotamer, then the second rotamer cannot be partof the global optimum solution. Since the energies of all rotamers havealready been calculated, the DEE approach only requires sums over thesequence length to test and eliminate rotamers, which speeds up thecalculations considerably. DEE can be rerun comparing pairs of rotamers,or combinations of rotamers, which will eventually result in thedetermination of a single sequence which represents the global optimumenergy.

Once the global solution has been found, a Monte Carlo search may bedone to generate a rank-ordered list of sequences in the neighborhood ofthe DEE solution. Starting at the DEE solution, random positions arechanged to other rotamers, and the new sequence energy is calculated. Ifthe new sequence meets the criteria for acceptance, it is used as astarting point for another jump. After a predetermined number of jumps,a rank-ordered list of sequences is generated.

The results may then be experimentally verified by physically generatingone or more of the protein sequences followed by experimental testing.The information obtained from the testing can then be fed back into theanalysis, to modify the procedure if necessary.

Thus, the present invention provides a computer-assisted method ofdesigning a protein. The method comprises providing a protein backbonestructure with variable residue positions, and then establishing a groupof potential rotamers for each of the residue positions. As used herein,the backbone, or template, includes the backbone atoms and any fixedside chains. The interactions between the protein backbone and thepotential rotamers, and between pairs of the potential rotamers, arethen processed to generate a set of optimized protein sequences,preferably a single global optimum, which then may be used to generateother related sequences.

FIG. 1 illustrates an automated protein design apparatus 20 inaccordance with an embodiment of the invention. The apparatus 20includes a central processing unit 22 which communicates with a memory24 and a set of input/output devices (e.g., keyboard, mouse, monitor,printer, etc.) 26 through a bus 28. The general interaction between acentral processing unit 22, a memory 24, input/output devices 26, and abus 28 is known in the art. The present invention is directed toward theautomated protein design program 30 stored in the memory 24.

The automated protein design program 30 may be implemented with a sidechain module 32. As discussed in detail below, the side chain moduleestablishes a group of potential rotamers for a selected proteinbackbone structure. The protein design program 30 may also beimplemented with a ranking module 34. As discussed in detail below, theranking module 34 analyzes the interaction of rotamers with the proteinbackbone structure to generate optimized protein sequences. The proteindesign program 30 may also include a search module 36 to execute asearch, for example a Monte Carlo search as described below, in relationto the optimized protein sequences. Finally, an assessment module 38 mayalso be used to assess physical parameters associated with the derivedproteins, as discussed further below.

The memory 24 also stores a protein backbone structure 40, which isdownloaded by a user through the input/output devices 26. The memory 24also stores information on potential rotamers derived by the side chainmodule 32. In addition, the memory 24 stores protein sequences 44generated by the ranking module 34. The protein sequences 44 may bepassed as output to the input/output devices 26.

The operation of the automated protein design apparatus 20 is more fullyappreciated with reference to FIG. 2. FIG. 2 illustrates processingsteps executed in accordance with the method of the invention. Asdescribed below, many of the processing steps are executed by theprotein design program 30. The first processing step illustrated in FIG.2 is to provide a protein backbone structure (step 50). As previouslyindicated, the protein backbone structure is downloaded through theinput/output devices 26 using standard techniques.

The protein backbone structure corresponds to a selected protein. By“protein” herein is meant at least two amino acids linked together by apeptide bond. As used herein, protein includes proteins, oligopeptidesand peptides. The peptidyl group may comprise naturally occurring aminoacids and peptide bonds, or synthetic peptidomimetic structures, i.e.“analogs”, such as peptoids (see Simon et al., PNAS USA 89(20):9367(1992)). The amino acids may either be naturally occuring ornon-naturally occuring; as will be appreciated by those in the art, anystructure for which a set of rotamers is known or can be generated canbe used as an amino acid. The side chains may be in either the (R) orthe (S) configuration. In a preferred embodiment, the amino acids are inthe (S) or L-configuration.

The chosen protein may be any protein for which a three dimensionalstructure is known or can be generated; that is, for which there arethree dimensional coordinates for each atom of the protein. Generallythis can be determined using X-ray crystallographic techniques, NMRtechniques, de novo modelling, homology modelling, etc. In general, ifX-ray structures are used, structures at 2A resolution or better arepreferred, but not required.

The proteins may be from any organism, including prokaryotes andeukaryotes, with enzymes from bacteria, fungi, extremeophiles such asthe archebacteria, insects, fish, animals (particularly mammals andparticularly human) and birds all possible.

Suitable proteins include, but are not limited to, industrial andpharmaceutical proteins, including ligands, cell surface receptors,antigens, antibodies, cytokines, hormones, and enzymes. Suitable classesof enzymes include, but are not limited to, hydrolases such asproteases, carbohydrases, lipases; isomerases such as racemases,epimerases, tautomerases, or mutases; transferases, kinases,oxidoreductases, and phophatases. Suitable enzymes are listed in theSwiss-Prot enzyme database.

Suitable protein backbones include, but are not limited to, all of thosefound in the protein data base compiled and serviced by the BrookhavenNational Lab.

Specifically included within “protein” are fragments and domains ofknown proteins, including functional domains such as enzymatic domains,binding domains, etc., and smaller fragments, such as turns, loops, etc.That is, portions of proteins may be used as well.

Once the protein is chosen, the protein backbone structure is input intothe computer. By “protein backbone structure” or grammatical equivalentsherein is meant the three dimensional coordinates that define the threedimensional structure of a particular protein. The structures whichcomprise a protein backbone structure (of a naturally occuring protein)are the nitrogen, the carbonyl carbon, the α-carbon, and the carbonyloxygen, along with the direction of the vector from the α-carbon to theβ-carbon.

The protein backbone structure which is input into the computer caneither include the coordinates for both the backbone and the amino acidside chains, or just the backbone, i.e. with the coordinates for theamino acid side chains removed. If the former is done, the side chainatoms of each amino acid of the protein structure may be “stripped” orremoved from the structure of a protein, as is known in the art, leavingonly the coordinates for the “backbone” atoms (the nitrogen, carbonylcarbon and oxygen, and the α-carbon, and the hydrogens attached to thenitrogen and α-carbon).

After inputing the protein structure backbone, explicit hydrogens areadded if not included within the structure (for example, if thestructure was generated by X-ray crystallography, hydrogens must beadded). After hydrogen addition, energy minimization of the structure isrun, to relax the hydrogens as well as the other atoms, bond angles andbond lengths. In a preferred embodiment, this is done by doing a numberof steps of conjugate gradient minimization (Mayo et al., J. Phys. Chem.94:8897 (1990)) of atomic coordinate positions to minimize the Dreidingforce field with no electrostatics. Generally from about 10 to about 250steps is preferred, with about 50 being most preferred.

The protein backbone structure contains at least one variable residueposition. As is known in the art, the residues, or amino acids, ofproteins are generally sequentially numbered starting with theN-terminus of the protein. Thus a protein having a methionine at it'sN-terminus is said to have a methionine at residue or amino acidposition 1, with the next residues as 2, 3, 4, etc. At each position,the wild type (i.e. naturally occuring) protein may have one of at least20 amino acids, in any number of rotamers. By “variable residueposition” herein is meant an amino acid position of the protein to bedesigned that is not fixed in the design method as a specific residue orrotamer, generally the wild-type residue or rotamer.

In a preferred embodiment, all of the residue positions of the proteinare variable. That is, every amino acid side chain may be altered in themethods of the present invention. This is particularly desirable forsmaller proteins, although the present methods allow the design oflarger proteins as well. While there is no theoretical limit to thelength of the protein which may be designed this way, there is apractical computational limit.

In an alternate preferred embodiment, only some of the residue positionsof the protein are variable, and the remainder are “fixed”, that is,they are identified in the three dimensional structure as being in a setconformation. In some embodiments, a fixed position is left in itsoriginal conformation (which may or may not correlate to a specificrotamer of the rotamer library being used). Alternatively, residues maybe fixed as a non-wild type residue; for example, when knownsite-directed mutagenesis techniques have shown that a particularresidue is desirable (for example, to eliminate a proteolytic site oralter the substrate specificity of an enzyme), the residue may be fixedas a particular amino acid. Alternatively, the methods of the presentinvention may be used to evaluate mutations de novo, as is discussedbelow. In an alternate preferred embodiment, a fixed position may be“floated”; the amino acid at that position is fixed, but differentrotamers of that amino acid are tested. In this embodiment, the variableresidues may be at least one, or anywhere from 0.1% to 99.9% of thetotal number of residues. Thus, for example, it may be possible tochange only a few (or one) residues, or most of the residues, with allpossibilities in between.

In a preferred embodiment, residues which can be fixed include, but arenot limited to, structurally or biologically functional residues. Forexample, residues which are known to be important for biologicalactivity, such as the residues which form the active site of an enzyme,the substrate binding site of an enzyme, the binding site for a bindingpartner (ligand/receptor, antigen/antibody, etc.), phosphorylation orglycosylation sites which are crucial to biological function, orstructurally important residues, such as disulfide bridges, metalbinding sites, critical hydrogen bonding residues, residues critical forbackbone conformation such as proline or glycine, residues critical forpacking interactions, etc. may all be fixed in a conformation or as asingle rotamer, or “floated”.

Similarly, residues which may be chosen as variable residues may bethose that confer undesirable biological attributes, such assusceptibility to proteolytic degradation, dimerization or aggregationsites, glycosylation sites which may lead to immune responses, unwantedbinding activity, unwanted allostery, undesirable enzyme activity butwith a preservation of binding, etc.

As will be appreciated by those in the art, the methods of the presentinvention allow computational testing of “site-directed mutagenesis”targets without actually making the mutants, or prior to making themutants. That is, quick analysis of sequences in which a small number ofresidues are changed can be done to evaluate whether a proposed changeis desirable. In addition, this may be done on a known protein, or on anprotein optimized as described herein.

As will be appreciated by those in the art, a domain of a larger proteinmay essentially be treated as a small independent protein; that is, astructural or functional domain of a large protein may have minimalinteractions with the remainder of the protein and may essentially betreated as if it were autonomous. In this embodiment, all or part of theresidues of the domain may be variable.

It should be noted that even if a position is chosen as a variableposition, it is possible that the methods of the invention will optimizethe sequence in such a way as to select the wild type residue at thevariable position. This generally occurs more frequently for coreresidues, and less regularly for surface residues. In addition, it ispossible to fix residues as non-wild type amino acids as well.

Once the protein backbone structure has been selected and input, and thevariable residue positions chosen, a group of potential rotamers foreach of the variable residue positions is established. This operation isshown as step 52 in FIG. 2. This step may be implemented using the sidechain module 32. In one embodiment of the invention, the side chainmodule 32 includes at least one rotamer library, as described below, andprogram code that correlates the selected protein backbone structurewith corresponding information in the rotamer library. Alternatively,the side chain module 32 may be omitted and the potential rotamers 42for the selected protein backbone structure may be downloaded throughthe input/output devices 26.

As is known in the art, each amino acid side chain has a set of possibleconformers, called rotamers. See Ponder, et al., Acad. Press Inc.(London) Ltd. pp. 775-791 (1987); Dunbrack, et al., Struc. Biol.1(5):334-340 (1994); Desmet, et al., Nature 356:539-542 (1992), all ofwhich are hereby expressly incorporated by reference in their entireity.Thus, a set of discrete rotamers for every amino acid side chain isused. There are two general types of rotamer libraries: backbonedependent and backbone independent. A backbone dependent rotamer libraryallows different rotamers depending on the position of the residue inthe backbone; thus for example, certain leucine rotamers are allowed ifthe position is within an a helix, and different leucine rotamers areallowed if the position is not in a α-helix. A backbone independentrotamer library utilizes all rotamers of an amino acid at everyposition. In general, a backbone independent library is preferred in theconsideration of core residues, since flexibility in the core isimportant. However, backbone independent libraries are computationallymore expensive, and thus for surface and boundary positions, a backbonedependent library is preferred. However, either type of library can beused at any position.

In addition, a preferred embodiment does a type of “fine tuning” of therotamer library by expanding the possible χ (chi) angle values of therotamers by plus and minus one standard deviation (or more) about themean value, in order to minimize possible errors that might arise fromthe discreteness of the library. This is particularly important foraromatic residues, and fairly important for hydrophobic residues, due tothe increased requirements for flexibility in the core and the rigidityof aromatic rings; it is not as important for the other residues. Thus apreferred embodiment expands the χ₁ and χ₂ angles for all amino acidsexcept Met, Arg and Lys.

To roughly illustrate the numbers of rotamers, in one version of theDunbrack & Karplus backbone-dependent rotamer library, alanine has 1rotamer, glycine has 1 rotamer, arginine has 55 rotamers, threonine has9 rotamers, lysine has 57 rotamers, glutamic acid has 69 rotamers,asparagine has 54 rotamers, aspartic acid has 27 rotamers, tryptophanhas 54 rotamers, tyrosine has 36 rotamers, cysteine has 9 rotamers,glutamine has 69 rotamers, histidine has 54 rotamers, valine has 9rotamers, isoleucine has 45 rotamers, leucine has 36 rotamers,methionine has 21 rotamers, serine has 9 rotamers, and phenylalanine has36 rotamers.

In general, proline is not generally used, since it will rarely bechosen for any position, although it can be included if desired.Similarly, a preferred embodiment omits cysteine as a consideration,only to avoid potential disulfide problems, although it can be includedif desired.

As will be appreciated by those in the art, other rotamer libraries withall dihedral angles staggered can be used or generated.

In a preferred embodiment, at a minimum, at least one variable positionhas rotamers from at least two different amino acid side chains; thatis, a sequence is being optimized, rather than a structure.

In a preferred embodiment, rotamers from all of the amino acids (or allof them except cysteine, glycine and proline) are used for each variableresidue position; that is, the group or set of potential rotamers ateach variable position is every possible rotamer of each amino acid.This is especially preferred when the number of variable positions isnot high as this type of analysis can be computationally expensive.

In a preferred embodiment, each variable position is classified aseither a core, surface or boundary residue position, although in somecases, as explained below, the variable position may be set to glycineto minimize backbone strain.

It should be understood that quantitative protein design or optimizationstudies prior to the present invention focused almost exclusively oncore residues. The present invention, however, provides methods fordesigning proteins containing core, surface and boundary positions.Alternate embodiments utilize methods for designing proteins containingcore and surface residues, core and boundary residues, and surface andboundary residues, as well as core residues alone (using the scoringfunctions of the present invention), surface residues alone, or boundaryresidues alone.

The classification of residue positions as core, surface or boundary maybe done in several ways, as will be appreciated by those in the art. Ina preferred embodiment, the classification is done via a visual scan ofthe original protein backbone structure, including the side chains, andassigning a classification based on a subjective evaluation of oneskilled in the art of protein modelling. Alternatively, a preferredembodiment utilizes an assessment of the orientation of the Cα-Cβvectors relative to a solvent accessible surface computed using only thetemplate Cα atoms. In a preferred embodiment, the solvent accessiblesurface for only the Cα atoms of the target fold is generated using theConnolly algorithm with a probe radius ranging from about 4 to about 12Å, with from about 6 to about 10 Å being preferred, and 8 Å beingparticularly preferred. The Cα radius used ranges from about 1.6 Å toabout 2.3 Å, with from about 1.8 to about 2.1 Å being preferred, and1.95 Å being especially preferred. A residue is classified as a coreposition if a) the distance for its Cα, along its Cα-Cβ vector, to thesolvent accessible surface is greater than about 4-6 Å, with greaterthan about 5.0 Å being especially preferred, and b) the distance for itsCβ to the nearest surface point is greater than about 1.5-3 Å, withgreater than about 2.0 Å being especially preferred. The remainingresidues are classified as surface positions if the sum of the distancesfrom their Cα, along their Cα-Cβ vector, to the solvent accessiblesurface, plus the distance from their Cβ to the closest surface pointwas less than about 2.5-4 Å, with less than about 2.7 Å being especiallypreferred. All remaining residues are classified as boundary positions.

Once each variable position is classified as either core, surface orboundary, a set of amino acid side chains, and thus a set of rotamers,is assigned to each position. That is, the set of possible amino acidside chains that the program will allow to be considered at anyparticular position is chosen. Subsequently, once the possible aminoacid side chains are chosen, the set of rotamers that will be evaluatedat a particular position can be determined. Thus, a core residue willgenerally be selected from the group of hydrophobic residues consistingof alanine, valine, isoleucine, leucine, phenylalanine, tyrosine,tryptophan, and methionine (in some embodiments, when the α scalingfactor of the van der Waals scoring function, described below, is low,methionine is removed from the set), and the rotamer set for each coreposition potentially includes rotamers for these eight amino acid sidechains (all the rotamers if a backbone independent library is used, andsubsets if a rotamer dependent backbone is used). Similarly, surfacepositions are generally selected from the group of hydrophilic residuesconsisting of alanine, serine, threonine, aspartic acid, asparagine,glutamine, glutamic acid, arginine, lysine and histidine. The rotamerset for each surface position thus includes rotamers for these tenresidues. Finally, boundary positions are generally chosen from alanine,serine, threonine, aspartic acid, asparagine, glutamine, glutamic acid,arginine, lysine histidine, valine, isoleucine, leucine, phenylalanine,tyrosine, tryptophan, and methionine. The rotamer set for each boundaryposition thus potentially includes every rotamer for these seventeenresidues (assuming cysteine, glycine and proline are not used, althoughthey can be).

Thus, as will be appreciated by those in the art, there is acomputational benefit to classifying the residue positions, as itdecreases the number of calculations. It should also be noted that theremay be situations where the sets of core, boundary and surface residuesare altered from those described above; for example, under somecircumstances, one or more amino acids is either added or subtractedfrom the set of allowed amino acids. For example, some proteins whichdimerize or multimerize, or have ligand binding sites, may containhydrophobic surface residues, etc. In addition, residues that do notallow helix “capping” or the favorable interaction with an α-helixdipole may be subtracted from a set of allowed residues. Thismodification of amino acid groups is done on a residue by residue basis.

In a preferred embodiment, proline, cysteine and glycine are notincluded in the list of possible amino acid side chains, and thus therotamers for these side chains are not used. However, in a preferredembodiment, when the variable residue position has a φ angle (that is,the dihedral angle defined by 1) the carbonyl carbon of the precedingamino acid; 2) the nitrogen atom of the current residue; 3) the α-carbonof the current residue; and 4) the carbonyl carbon of the currentresidue) greater than 0°, the position is set to glycine to minimizebackbone strain.

Once the group of potential rotamers is assigned for each variableresidue position, processing proceeds to step 54 of FIG. 2. Thisprocessing step entails analyzing interactions of the rotamers with eachother and with the protein backbone to generate optimized proteinsequences. The ranking module 34 may be used to perform theseoperations. That is, computer code is written to implement the followingfunctions. Simplistically, as is generally outlined above, theprocessing initially comprises the use of a number of scoring functions,described below, to calculate energies of interactions of the rotamers,either to the backbone itself or other rotamers.

The scoring functions include a Van der Waals potential scoringfunction, a hydrogen bond potential scoring function, an atomicsolvation scoring function, a secondary structure propensity scoringfunction and an electrostatic scoring function. As is further describedbelow, at least one scoring function is used to score each position,although the scoring functions may differ depending on the positionclassification or other considerations, like favorable interaction withan α-helix dipole. As outlined below, the total energy which is used inthe calculations is the sum of the energy of each scoring function usedat a particular position, as is generally shown in Equation 1:E _(total) =nE _(vdw) +nE _(as) +nE _(h-bonding) +nE _(ss) +nE_(elec)  Equation 1

In Equation 1, the total energy is the sum of the energy of the van derWaals potential (E_(vdw)), the energy of atomic solvation (E_(as)), theenergy of hydrogen bonding (E_(h-bonding)), the energy of secondarystructure (E_(ss)) and the energy of electrostatic interaction(E_(elec)). The term n is either 0 or 1, depending on whether the termis to be considered for the particular residue position, as is morefully outlined below.

In a preferred embodiment, a van der Waals' scoring function is used. Asis known in the art, van der Waals' forces are the weak, non-covalentand non-ionic forces between atoms and molecules, that is, the induceddipole and electron repulsion (Pauli principle) forces.

The van der Waals scoring function is based on a van der Waals potentialenergy. There are a number of van der Waals potential energycalculations, including a Lennard-Jones 12/6 potential with radii andwell depth parameters from the Dreiding force field, Mayo et al., J.Prot. Chem., 1990, expressly incorporated herein by reference, or theexponential 6 potential. Equation 2, shown below, is the preferredLennard-Jones potential: $\begin{matrix}{E_{vdw} = {D_{0}\left\{ {\left( \frac{R_{0}}{R} \right)^{12} - {2\left( \frac{R_{0}}{R} \right)^{6}}} \right\}}} & {{Equation}\quad 2}\end{matrix}$

R₀ is the geometric mean of the van der Waals radii of the two atomsunder consideration, and D₀ is the geometric mean of the well depth ofthe two atoms under consideration. E_(vdw) and R are the energy andinteratomic distance between the two atoms under consideration, as ismore fully described below.

In a preferred embodiment, the van der Waals forces are scaled using ascaling factor, α, as is generally discussed in Example 4. Equation 3shows the use of α in the van der Waals Lennard-Jones potentialequation: $\begin{matrix}{E_{vdw} = {D_{0}\left\{ {\left( \frac{\alpha\quad R_{0}}{R} \right)^{12} - {2\left( \frac{\alpha\quad R_{0}}{R} \right)^{6}}} \right\}}} & {{Equation}\quad 3}\end{matrix}$

The role of the α scaling factor is to change the importance of packingeffects in the optimization and design of any particular protein. Asdiscussed in the Examples, different values for α result in differentsequences being generated by the present methods. Specifically, areduced van der Waals steric constraint can compensate for therestrictive effect of a fixed backbone and discrete side-chain rotamersin the simulation and can allow a broader sampling of sequencescompatible with a desired fold. In a preferred embodiment, α valuesranging from about 0.70 to about 1.10 can be used, with a values fromabout 0.8 to about 1.05 being preferred, and from about 0.85 to about1.0 being especially preferred. Specific α values which are preferredare 0.80, 0.85, 0.90, 0.95, 1.00, and 1.05.

Generally speaking, variation of the van der Waals scale factor αresults in four regimes of packing specificity: regime 1 where0.9≦α≦1.05 and packing constraints dominate the sequence selection;regime 2 where 0.8≦α≦0.9 and the hydrophobic solvation potential beginsto compete with packing forces; regime 3 where α<0.8 and hydrophobicsolvation dominates the design; and, regime 4 where α>1.05 and van derWaals repulsions appear to be too severe to allow meaningful sequenceselection. In particular, different α values may be used for core,surface and boundary positions, with regimes 1 and 2 being preferred forcore residues, regime 1 being preferred for surface residues, and regime1 and 2 being preferred for boundary residues.

In a preferred embodiment, the van der Waals scaling factor is used inthe total energy calculations for each variable residue position,including core, surface and boundary positions.

In a preferred embodiment, an atomic solvation potential scoringfunction is used. As is appreciated by those in the art, solventinteractions of a protein are a significant factor in protein stability,and residue/protein hydrophobicity has been shown to be the majordriving force in protein folding. Thus, there is an entropic cost tosolvating hydrophobic surfaces, in addition to the potential formisfolding or aggregation. Accordingly, the burial of hydrophobicsurfaces within a protein structure is beneficial to both folding andstability. Similarly, there can be a disadvantage for buryinghydrophilic residues. The accessible surface area of a protein atom isgenerally defined as the area of the surface over which a water moleculecan be placed while making van der Waals contact with this atom and notpenetrating any other protein atom. Thus, in a preferred embodiment, thesolvation potential is generally scored by taking the total possibleexposed surface area of the moiety or two independent moieties (either arotamer or the first rotamer and the second rotamer), which is thereference, and subtracting out the “buried” area, i.e. the area which isnot solvent exposed due to interactions either with the backbone or withother rotamers. This thus gives the exposed surface area.

Alternatively, a preferred embodiment calculates the scoring function onthe basis of the “buried” portion; i.e. the total possible exposedsurface area is calculated, and then the calculated surface area afterthe interaction of the moieties is subtracted, leaving the buriedsurface area. A particularly preferred method does both of thesecalculations.

As is more fully described below, both of these methods can be done in avariety of ways. See Eisenberg et al., Nature 319:199-203 (1986);Connolly, Science 221:709-713 (1983); and Wodak, et al., Proc. Natl.Acad. Sci. USA 77(4):1736-1740 (1980), all of which are expresslyincorporated herein by reference. As will be appreciated by those in theart, this solvation potential scoring function is conformationdependent, rather than conformation independent.

In a preferred embodiment, the pairwise solvation potential isimplemented in two components, “singles” (rotamer/template) and“doubles” (rotamer/rotamer), as is more fully described below. For therotamer/template buried area, the reference state is defined as therotamer in question at residue position i with the backbone atoms onlyof residues i−1, i and i+1, although in some instances just i may beused. Thus, in a preferred embodiment, the solvation potential is notcalculated for the interaction of each backbone atom with a particularrotamer, although more may be done as required. The area of the sidechain is calculated with the backbone atoms excluding solvent but notcounted in the area. The folded state is defined as the area of therotamer in question at residue i, but now in the context of the entiretemplate structure including non-optimized side chains, i.e. every otherfoxed position residue. The rotamer/template buried area is thedifference between the reference and the folded states. Therotamer/rotamer reference area can be done in two ways; one by usingsimply the sum of the areas of the isolated rotamers; the secondincludes the full backbone. The folded state is the area of the tworotamers placed in their relative positions on the protein scaffold butwith no template atoms present. In a preferred embodiment, the Richardsdefinition of solvent accessible surface area (Lee and Richards, J. Mol.Biol. 55:379-400, 1971, hereby incorporated by reference) is used, witha probe radius ranging from 0.8 to 1.6 Å, with 1.4 Å being preferred,and Drieding van der Waals radii, scaled from 0.8 to 1.0. Carbon andsulfur, and all attached hydrogens, are considered nonpolar. Nitrogenand oxygen, and all attached hydrogens, are considered polar. Surfaceareas are calculated with the Connolly algorithm using a dot density of10 Å−2 (Connolly, (1983) (supra), hereby incorporated by reference).

In a preferred embodiment, there is a correction for a possibleoverestimation of buried surface area which may exist in the calculationof the energy of interaction between two rotamers (but not theinteraction of a rotamer with the backbone). Since, as is generallyoutlined below, rotamers are only considered in pairs, that is, a firstrotamer is only compared to a second rotamer during the “doubles”calculations, this may overestimate the amount of buried surface area inlocations where more than two rotamers interact, that is, where rotamersfrom three or more residue positions come together. Thus, a correctionor scaling factor is used as outlined below.

The general energy of solvation is shown in Equation 4:E _(sa) =f(SA)  Equation 4where E_(sa) is the energy of solvation, f is a constant used tocorrelate surface area and energy, and SA is the surface area. Thisequation can be broken down, depending on which parameter is beingevaluated. Thus, when the hydrophobic buried surface area is used,Equation 5 is appropriate:E _(sa) =f ₁(SA _(buried hydrophobic))  Equation 5where f₁ is a constant which ranges from about 10 to about 50cal/mol/Å², with 23 or 26 cal/mol/Å² being preferred. When a penalty forhydrophilic burial is being considered, the equation is shown inEquation 6:E _(sa) =f ₁(SA _(buried hydrophobic))+f ₂(SA_(buried hydrophilic))  Equation 6where f₂ is a constant which ranges from −50 to −250 cal/mol/Å², with−86 or −100 cal/mol/Å² being preferred. Similarly, if a penalty forhydrophobic exposure is used, equation 7 or 8 may be used:E _(sa) =f ₁(SA _(buried hydrophobic))+f ₃(SA_(exposed hydrophobic))  Equation 7E _(sa) =f ₁(SA _(buried hydrophobic))+f ₂(SA _(buried hydrophilic))+f₃(SA _(exposed hydrophobic))+f ₄(SA _(exposed hydrophilic))  Equation 8

In a preferred embodiment, f₃=−f₁.

In one embodiment, backbone atoms are not included in the calculation ofsurface areas, and values of 23 cal/mol/Å² (f₁) and −86 cal/mol/Å² (f₂)are determined.

In a preferred embodiment, this overcounting problem is addressed usinga scaling factor that compensates for only the portion of the expressionfor pairwise area that is subject to over-counting. In this embodiment,values of −26 cal/mol/Å² (f₁) and 100 cal/mol/Å² (f₂) are determined.

Atomic solvation energy is expensive, in terms of computational time andresources. Accordingly, in a preferred embodiment, the solvation energyis calculated for core and/or boundary residues, but not surfaceresidues, with both a calculation for core and boundary residues beingpreferred, although any combination of the three is possible.

In a preferred embodiment, a hydrogen bond potential scoring function isused. A hydrogen bond potential is used as predicted hydrogen bonds docontribute to designed protein stability (see Stickle et al., J. Mol.Biol. 226:1143 (1992); Huyghues-Despointes et al., Biochem. 34:13267(1995), both of which are expressly incorporated herein by reference).As outlined previously, explicit hydrogens are generated on the proteinbackbone structure.

In a preferred embodiment, the hydrogen bond potential consists of adistance-dependent term and an angle-dependent term, as shown inEquation 9: $\begin{matrix}{E_{H\text{-}{Bonding}} = {D_{0}\left\{ {{5\left( \frac{R_{0}}{R} \right)^{12}} - {6\left( \frac{R_{0}}{R} \right)^{10}}} \right\}\quad{F\left( {\theta,\phi,\varphi} \right)}}} & {{Equation}\quad 9}\end{matrix}$where R₀ (2.8 Å) and D₀ (8 kcal/mol) are the hydrogen-bond equilibriumdistance and well-depth, respectively, and R is the donor to acceptordistance. This hydrogen bond potential is based on the potential used inDREIDING with more restrictive angle-dependent terms to limit theoccurrence of unfavorable hydrogen bond geometries. The angle termvaries depending on the hybridization state of the donor and acceptor,as shown in Equations 10, 11, 12 and 13. Equation 10 is used for sp³donor to sp³ acceptor; Equation 11 is used for sp³ donor to sp²acceptor, Equation 12 is used for sp² donor to sp³ acceptor, andEquation 13 is used for sp² donor to sp² acceptor:F=cos² θ cos²(φ−109.5)  Equation 10F=cos² θ cos² φ  Equation 11F=cos⁴ θ  Equation 12F=cos² θ cos²(max [φ, Φ])  Equation 13

In Equations 10-13, θ is the donor-hydrogen-acceptor angle, Φ is thehydrogen-acceptor-base angle (the base is the atom attached to theacceptor, for example the carbonyl carbon is the base for a carbonyloxygen acceptor), and φ is the angle between the normals of the planesdefined by the six atoms attached to the sp² centers (the supplement ofΦ is used when Φ is less than 90°). The hydrogen-bond function is onlyevaluated when 2.6 Å≦R≦3.2 Å, θ>90°, φ−109.5°<90° for the sp³ donor-sp³acceptor case, and, φ>90° for the sp³ donor-sp² acceptor case;preferably, no switching functions are used. Template donors andacceptors that are involved in template-template hydrogen bonds arepreferably not included in the donor and acceptor lists. For the purposeof exclusion, a template-template hydrogen bond is considered to existwhen 2.5 Å≦R≦3.3 Å and θ≧135°.

The hydrogen-bond potential may also be combined or used with a weakcoulombic term that includes a distance-dependent dielectric constant of40R, where R is the interatomic distance. Partial atomic charges arepreferably only applied to polar functional groups. A net formal chargeof +1 is used for Arg and Lys and a net formal charge of −1 is used forAsp and Glu; see Gasteiger, et al., Tetrahedron 36:3219-3288 (1980);Rappe, et al., J. Phys. Chem. 95:3358-3363 (1991).

In a preferred embodiment, an explicit penalty is given for buried polarhydrogen atoms which are not hydrogen bonded to another atom. SeeEisenberg, et al., (1986) (supra), hereby expressly incorporated byreference. In a preferred embodiment, this penalty for polar hydrogenburial, is from about 0 to about 3 kcal/mol, with from about 1 to about3 being preferred and 2 kcal/mol being particularly preferred. Thispenalty is only applied to buried polar hydrogens not involved inhydrogen bonds. A hydrogen bond is considered to exist when E_(HB)ranges from about 1 to about 4 kcal/mol, with E_(HB) of less than −2kcal/mol being preferred. In addition, in a preferred embodiment, thepenalty is not applied to template hydrogens, i.e. unpaired buriedhydrogens of the backbone.

In a preferred embodiment, only hydrogen bonds between a first rotamerand the backbone are scored, and rotamer-rotamer hydrogen bonds are notscored. In an alternative embodiment, hydrogen bonds between a firstrotamer and the backbone are scored, and rotamer-rotamer hydrogen bondsare scaled by 0.5.

In a preferred embodiment, the hydrogen bonding scoring function is usedfor all positions, including core, surface and boundary positions. Inalternate embodiments, the hydrogen bonding scoring function may be usedon only one or two of these.

In a preferred embodiment, a secondary structure propensity scoringfunction is used. This is based on the specific amino acid side chain,and is conformation independent. That is, each amino acid has a certainpropensity to take on a secondary structure, either α-helix or β-sheet,based on its φ and ψ angles. See Muñoz et al., Current Op. in Biotech.6:382 (1995); Minor, et al., Nature 367:660-663 (1994); Padmanabhan, etal., Nature 344:268-270 (1990); Muñoz, et al., Folding & Design1(3):167-178 (1996); and Chakrabartty, et al., Protein Sci. 3:843(1994), all of which are expressly incorporated herein by reference.Thus, for variable residue positions that are in recognizable secondarystructure in the backbone, a secondary structure propensity scoringfunction is preferably used. That is, when a variable residue positionis in an α-helical area of the backbone, the α-helical propensityscoring function described below is calculated. Whether or not aposition is in a α-helical area of the backbone is determined as will beappreciated by those in the art, generally on the basis of φ and ψangles; for α-helix, φ angles from −2 to −70 and ω angles from −30 to−100 generally describe an α-helical area of the backbone.

Similarly, when a variable residue position is in a β-sheet backboneconformation, the β-sheet propensity scoring function is used. β-sheetbackbone conformation is generally described by φ angles from −30 to−100 and χ angles from +40 to +180. In alternate preferred embodiments,variable residue positions which are within areas of the backbone whichare not assignable to either β-sheet or α-helix structure may also besubjected to secondary structure propensity calculations.

In a preferred embodiment, energies associated with secondarypropensities are calculated using Equation 14:E _(∝)=10^(N) ^(ss) ^((ΔG°) ^(aa) ^(−ΔG°) ^(ala) ⁾−1  Equation 14

In Equation 14, E_(α) (or Eβ) is the energy of α-helical propensity,ΔG°_(aa) is the standard free energy of helix propagation of the aminoacid, and ΔG°_(ala) is the standard free energy of helix propagation ofalanine used as a standard, or standard free energy of β-sheet formationof the amino acid, both of which are available in the literature (seeChakrabartty, et al., (1994) (supra), and Munoz, et al., Folding &Design 1(3):167-178 (1996)), both of which are expressly incorporatedherein by reference), and N_(ss) is the propensity scale factor which isset to range from 1 to 4, with 3.0 being preferred. This potential ispreferably selected in order to scale the propensity energies to asimilar range as the other terms in the scoring function.

In a preferred embodiment, β-sheet propensities are preferablycalculated only where the i−1 and i+1 residues are also in β-sheetconformation.

In a preferred embodiment, the secondary structure propensity scoringfunction is used only in the energy calculations for surface variableresidue positions. In alternate embodiments, the secondary structurepropensity scoring function is used in the calculations for core andboundary regions as well.

In a preferred embodiment, an electrostatic scoring function is used, asshown below in Equation 15: $\begin{matrix}{\underset{{ɛ\quad r^{2}}\quad}{\underset{\_}{E_{elec} = {q\quad q^{\prime}}}}\quad} & {{Equation}\quad 15}\end{matrix}$

In this Equation, q is the charge on atom 1, q′ is charge on atom 2, andr is the interaction distance.

In a preferred embodiment, at least one scoring function is used foreach variable residue position; in preferred embodiments, two, three orfour scoring functions are used for each variable residue position.

Once the scoring functions to be used are identified for each variableposition, the preferred first step in the computational analysiscomprises the determination of the interaction of each possible rotamerwith all or part of the remainder of the protein. That is, the energy ofinteraction, as measured by one or more of the scoring functions, ofeach possible rotamer at each variable residue position with either thebackbone or other rotamers, is calculated. In a preferred embodiment,the interaction of each rotamer with the entire remainder of theprotein, i.e. both the entire template and all other rotamers, is done.However, as outlined above, it is possible to only model a portion of aprotein, for example a domain of a larger protein, and thus in somecases, not all of the protein need be considered.

In a preferred embodiment, the first step of the computationalprocessing is done by calculating two sets of interactions for eachrotamer at every position (step 70 of FIG. 3): the interaction of therotamer side chain with the template or backbone (the “singles” energy),and the interaction of the rotamer side chain with all other possiblerotamers at every other position (the “doubles” energy), whether thatposition is varied or floated. It should be understood that the backbonein this case includes both the atoms of the protein structure backbone,as well as the atoms of any fixed residues, wherein the fixed residuesare defined as a particular conformation of an amino acid.

Thus, “singles” (rotamer/template) energies are calculated for theinteraction of every possible rotamer at every variable residue positionwith the backbone, using some or all of the scoring functions. Thus, forthe hydrogen bonding scoring function, every hydrogen bonding atom ofthe rotamer and every hydrogen bonding atom of the backbone isevaluated, and the E_(HB) is calculated for each possible rotamer atevery variable position. Similarly, for the van der Waals scoringfunction, every atom of the rotamer is compared to every atom of thetemplate (generally excluding the backbone atoms of its own residue),and the E_(vdW) is calculated for each possible rotamer at everyvariable residue position. In addition, generally no van der Waalsenergy is calculated if the atoms are connected by three bonds or less.For the atomic solvation scoring function, the surface of the rotamer ismeasured against the surface of the template, and the E_(as) for eachpossible rotamer at every variable residue position is calculated. Thesecondary structure propensity scoring function is also considered as asingles energy, and thus the total singles energy may contain an E_(ss)term. As will be appreciated by those in the art, many of these energyterms will be close to zero, depending on the physical distance betweenthe rotamer and the template position; that is, the farther apart thetwo moieties, the lower the energy.

Accordingly, as outlined above, the total singles energy is the sum ofthe energy of each scoring function used at a particular position, asshown in Equation 1, wherein n is either 1 or zero, depending on whetherthat particular scoring function was used at the rotamer position:E _(total) =nE _(vdw) +nE _(as) +nE _(h-bonding) +nE _(ss) +nE_(elec)  Equation 1

Once calculated, each singles E_(total) for each possible rotamer isstored in the memory 24 within the computer, such that it may be used insubsequent calculations, as outlined below.

For the calculation of “doubles” energy (rotamer/rotamer), theinteraction energy of each possible rotamer is compared with everypossible rotamer at all other variable residue positions. Thus,“doubles” energies are calculated for the interaction of every possiblerotamer at every variable residue position with every possible rotamerat every other variable residue position, using some or all of thescoring functions. Thus, for the hydrogen bonding scoring function,every hydrogen bonding atom of the first rotamer and every hydrogenbonding atom of every possible second rotamer is evaluated, and theE_(HB) is calculated for each possible rotamer pair for any two variablepositions. Similarly, for the van der Waals scoring function, every atomof the first rotamer is compared to every atom of every possible secondrotamer, and the E_(vdW) is calculated for each possible rotamer pair atevery two variable residue positions. For the atomic solvation scoringfunction, the surface of the first rotamer is measured against thesurface of every possible second rotamer, and the E_(as) for eachpossible rotamer pair at every two variable residue positions iscalculated. The secondary structure propensity scoring function need notbe run as a “doubles” energy, as it is considered as a component of the“singles” energy. As will be appreciated by those in the art, many ofthese double energy terms will be close to zero, depending on thephysical distance between the first rotamer and the second rotamer; thatis, the farther apart the two moieties, the lower the energy.

Accordingly, as outlined above, the total doubles energy is the sum ofthe energy of each scoring function used to evaluate every possible pairof rotamers, as shown in Equation 16, wherein n is either 1 or zero,depending on whether that particular scoring function was used at therotamer position:E _(total) =nE _(vdw) +nE _(as) +nE _(h-bonding) +E _(elec)  Equation 16

An example is illuminating. A first variable position, i, has three (anunrealistically low number) possible rotamers (which may be either froma single amino acid or different amino acids) which are labelled ia, ib,and ic. A second variable position, j, also has three possible rotamers,labelled jd, je, and jf. Thus, nine doubles energies (E_(total)) arecalculated in all: E_(total)(ia, jd), E_(total)(ia, je), E_(total)(ia,jf), E_(total)(ib, jd), E_(total)(ib, je), E_(total)(ib, jf),E_(total)(ic, jd), E_(total)(ic, je), and E_(total)(ic, jf).

Once calculated, each doubles E_(total) for each possible rotamer pairis stored in memory 24 within the computer, such that it may be used insubsequent calculations, as outlined below.

Once the singles and doubles energies are calculated and stored, thenext step of the computational processing may occur. Generally speaking,the goal of the computational processing is to determine a set ofoptimized protein sequences. By “optimized protein sequence” herein ismeant a sequence that best fits the mathematical equations herein. Aswill be appreciated by those in the art, a global optimized sequence isthe one sequence that best fits Equation 1, i.e. the sequence that hasthe lowest energy of any possible sequence. However, there are anynumber of sequences that are not the global minimum but that have lowenergies.

In a preferred embodiment, the set comprises the globally optimalsequence in its optimal conformation, i.e. the optimum rotamer at eachvariable position. That is, computational processing is run until thesimulation program converges on a single sequence which is the globaloptimum.

In a preferred embodiment, the set comprises at least two optimizedprotein sequences. Thus for example, the computational processing stepmay eliminate a number of disfavored combinations but be stopped priorto convergence, providing a set of sequences of which the global optimumis one. In addition, further computational analysis, for example using adifferent method, may be run on the set, to further eliminate sequencesor rank them differently. Alternatively, as is more fully describedbelow, the global optimum may be reached, and then further computationalprocessing may occur, which generates additional optimized sequences inthe neighborhood of the global optimum.

If a set comprising more than one optimized protein sequences isgenerated, they may be rank ordered in terms of theoretical quantitativestability, as is more fully described below.

In a preferred embodiment, the computational processing step firstcomprises an elimination step, sometimes referred to as “applying acutoff”, either a singles elimination or a doubles elimination. Singleselimination comprises the elimination of all rotamers with templateinteraction energies of greater than about 10 kcal/mol prior to anycomputation, with elimination energies of greater than about 15 kcal/molbeing preferred and greater than about 25 kcal/mol being especiallypreferred. Similarly, doubles elimination is done when a rotamer hasinteraction energies greater than about 10 kcal/mol with all rotamers ata second residue position, with energies greater than about 15 beingpreferred and greater than about 25 kcal/mol being especially preferred.

In a preferred embodiment, the computational processing comprises directdetermination of total sequence energies, followed by comparison of thetotal sequence energies to ascertain the global optimum and rank orderthe other possible sequences, if desired. The energy of a total sequenceis shown below in Equation 17: $\begin{matrix}{E_{{total}\quad{protein}} = {E_{({b\text{-}b})} + {\sum\limits_{{all}\quad i}E_{({ia})}} + {\sum\limits_{{all}\quad i}{\sum\limits_{{+ \quad{all}}\quad j\quad{pairs}}E_{({{ia},{ja}})}}}}} & {{Equation}\quad 17}\end{matrix}$

Thus every possible combination of rotamers may be directly evaluated byadding the backbone-backbone (sometimes referred to herein astemplate-template) energy (E_((b-b)) which is constant over allsequences herein since the backbone is kept constant), the singlesenergy for each rotamer (which has already been calculated and stored),and the doubles energy for each rotamer pair (which has already beencalculated and stored). Each total sequence energy of each possiblerotamer sequence can then be ranked, either from best to worst or worstto best. This is obviously computationally expensive and becomesunwieldy as the length of the protein increases.

In a preferred embodiment, the computational processing includes one ormore Dead-End Elimination (DEE) computational steps. The DEE theorem isthe basis for a very fast discrete search program that was designed topack protein side chains on a fixed backbone with a known sequence. SeeDesmet, et al., Nature 356:539-542 (1992); Desmet, et al., The ProteinFolding Problem and Tertiary Structure Prediction, Ch. 10:1-49 (1994);Goldstein, Biophys. Jour. 66:1335-1340 (1994), all of which areincorporated herein by reference. DEE is based on the observation thatif a rotamer can be eliminated from consideration at a particularposition, i.e. make a determination that a particular rotamer isdefinitely not part of the global optimal conformation, the size of thesearch is reduced. This is done by comparing the worst interaction (i.e.energy or E_(total)) of a first rotamer at a single variable positionwith the best interaction of a second rotamer at the same variableposition. If the worst interaction of the first rotamer is still betterthan the best interaction of the second rotamer, then the second rotamercannot possibly be in the optimal conformation of the sequence. Theoriginal DEE theorem is shown in Equation 18: $\begin{matrix}\begin{matrix}{\left. {{E({ia})} + {\underset{j\quad}{\Sigma\left\lbrack \min \right.}\quad{over}\quad t\left\{ {E\left( {{ia},{jt}} \right)} \right\}}} \right\rbrack >} \\{{E({ib})}\underset{j\quad}{+}{\Sigma\left\lbrack {\max\quad{over}\quad t\left\{ {E\left( {{ib},{jt}} \right)} \right\}} \right\rbrack}}\end{matrix} & {{Equation}\quad 18}\end{matrix}$

In Equation 18, rotamer ia is being compared to rotamer ib. The leftside of the inequality is the best possible interaction energy(E_(total)) of ia with the rest of the protein; that is, “min over t”means find the rotamer t on position j that has the best interactionwith rotamer ia. Similarly, the right side of the inequality is theworst possible (max) interaction energy of rotamer ib with the rest ofthe protein. If this inequality is true, then rotamer ia is Dead-Endingand can be Eliminated. The speed of DEE comes from the fact that thetheorem only requires sums over the sequence length to test andeliminate rotamers.

In a preferred embodiment, a variation of DEE is performed. GoldsteinDEE, based on Goldstein, (1994) (supra), hereby expressly incorporatedby reference, is a variation of the DEE computation, as shown inEquation 19:E(ia)−E(ib)+Σ[min over t{E(ia, jt)−E(ib, jt)}]>0  Equation 19

In essence, the Goldstein Equation 19 says that a first rotamer a of aparticular position i (rotamer ia) will not contribute to a local energyminimum if the energy of conformation with ia can always be lowered byjust changing the rotamer at that position to ib, keeping the otherresidues equal. If this inequality is true, then rotamer ia isDead-Ending and can be Eliminated.

Thus, in a preferred embodiment, a first DEE computation is done whererotamers at a single variable position are compared, (“singles” DEE) toeliminate rotamers at a single position. This analysis is repeated forevery variable position, to eliminate as many single rotamers aspossible. In addition, every time a rotamer is eliminated fromconsideration through DEE, the minimum and maximum calculations ofEquation 18 or 19 change, depending on which DEE variation is used, thusconceivably allowing the elimination of further rotamers. Accordingly,the singles DEE computation can be repeated until no more rotamers canbe eliminated; that is, when the inequality is not longer true such thatall of them could conceivably be found on the global optimum.

In a preferred embodiment, “doubles” DEE is additionally done. Indoubles DEE, pairs of rotamers are evaluated; that is, a first rotamerat a first position and a second rotamer at a second position arecompared to a third rotamer at the first position and a fourth rotamerat the second position, either using original or Goldstein DEE. Pairsare then flagged as nonallowable, although single rotamers cannot beeliminated, only the pair. Again, as for singles DEE, every time arotamer pair is flagged as nonallowable, the minimum calculations ofEquation 18 or 19 change (depending on which DEE variation is used) thusconceivably allowing the flagging of further rotamer pairs. Accordingly,the doubles DEE computation can be repeated until no more rotamer pairscan be flagged; that is, where the energy of rotamer pairs overlap suchthat all of them could conceivably be found on the global optimum.

In addition, in a preferred embodiment, rotamer pairs are initiallyprescreened to eliminate rotamer pairs prior to DEE. This is done bydoing relatively computationally inexpensive calculations to eliminatecertain pairs up front. This may be done in several ways, as is outlinedbelow.

In a preferred embodiment, the rotamer pair with the lowest interactionenergy with the rest of the system is found. Inspection of the energydistributions in sample matrices has revealed that an i_(u)j_(v) pairthat dead-end eliminates a particular ids pair can also eliminate otheri_(r)j_(s) pairs. In fact, there are often a few i_(u)i_(v) pairs, whichwe call “magic bullets,” that eliminate a significant number ofi_(r)j_(s) pairs. We have found that one of the most potent magicbullets is the pair for which maximum interaction energy,t_(max)([i_(u)j_(v)])k_(t), is least. This pair is referred to as[i_(u)i_(v)]_(mb). If this rotamer pair is used in the first round ofdoubles DEE, it tends to eliminate pairs faster.

Our first speed enhancement is to evaluate the first-order doublescalculation for only the matrix elements in the row corresponding to the[i_(u)j_(v)]_(mb) pair. The discovery of [i_(u)j_(v)]_(mb) is an n²calculation (n=the number of rotamers per position), and the applicationof Equation 19 to the single row of the matrix corresponding to thisrotamer pair is another n² calculation, so the calculation time is smallin comparison to a full first-order doubles calculation. In practice,this calculation produces a large number of dead-ending pairs, oftenenough to proceed to the next iteration of singles elimination withoutany further searching of the doubles matrix.

The magic bullet first-order calculation will also discover alldead-ending pairs that would be discovered by the Equation 18 or 19,thereby making it unnecessary. This stems from the fact thatε_(max)([i_(u)j_(v)]_(mb)) must be less than or equal to anyε_(max)([i_(u)j_(v)]) that would successfully eliminate a pair by heEquation 18 or 19.

Since the minima and maxima of any given pair has been precalculated asoutlined herein, a second speed-enhancement precalculation may be done.By comparing extrema, pairs that will not dead end can be identified andthus skipped, reducing the time of the DEE calculation. Thus, pairs thatsatisfy either one of the following criteria are skipped:ε_(min)([i _(r) j _(s)])<ε_(min)([i _(u) j _(v)])  Equation 20ε_(max)([i _(r) j _(s)])<ε_(max)([i _(u) j _(v)])  Equation 21:

Because the matrix containing these calculations is symmetrical, half ofits elements will satisfy the first inequality Equation 20, and half ofthose remaining will satisfy the other inequality Equation 21. Thesethree quarters of the matrix need not be subjected to the evaluation ofEquation 18 or 19, resulting in a theoretical speed enhancement of afactor of four.

The last DEE speed enhancement refines the search of the remainingquarter of the matrix. This is done by constructing a metric from theprecomputed extrema to detect those matrix elements likely to result ina dead-ending pair.

A metric was found through analysis of matrices from different sampleoptimizations. We searched for combinations of the extrema thatpredicted the likelihood that a matrix element would produce adead-ending pair. Interval sizes (see FIG. 12) for each pair werecomputed from differences of the extrema. The size of the overlap of thei_(r)j_(s) and i_(u)j_(v) intervals were also computed, as well as thedifference between the minima and the difference between the maxima.Combinations of these quantities, as well as the lone extrema, weretested for their ability to predict the occurrence of dead-ending pairs.Because some of the maxima were very large, the quantities were alsocompared logarithmically.

Most of the combinations were able to predict dead-ending matrixelements to varying degrees. The best metrics were the fractionalinterval overlap with respect to each pair, referred to herein as q_(rs)and q_(uv). $\begin{matrix}{q_{rs} = {\frac{{interval}\quad{overlap}}{{interval}\left( \left\lbrack {i_{r}\quad j_{s}} \right\rbrack \right)} = \frac{{ɛ_{\max}\left( \left\lbrack {i_{u}\quad j_{v}} \right\rbrack \right)} - {ɛ_{\min}\left( \left\lbrack {i_{r}\quad j_{s}} \right\rbrack \right)}}{{ɛ_{\max}\left( \left\lbrack {i_{r}\quad j_{s}} \right\rbrack \right)} - {ɛ_{\min}\left( \left\lbrack {i_{r}\quad j_{s}} \right\rbrack \right)}}}} & {{Equation}\quad 22} \\{q_{uv} = {\frac{{interval}\quad{overlap}}{{interval}\left( \left\lbrack {i_{u}\quad j_{v}} \right\rbrack \right)} = \frac{{ɛ_{\max}\left( \left\lbrack {i_{u}\quad j_{v}} \right\rbrack \right)} - {ɛ_{\min}\left( \left\lbrack {i_{r}\quad j_{s}} \right\rbrack \right)}}{{ɛ_{\max}\left( \left\lbrack {i_{u}\quad j_{v}} \right\rbrack \right)} - {ɛ_{\min}\left( \left\lbrack {i_{u}\quad j_{v}} \right\rbrack \right)}}}} & {{Equation}\quad 23}\end{matrix}$

These values are calculated using the minima and maxima equations 24,25, 26 and 27 (see FIG. 14): $\begin{matrix}{{ɛ_{\max}\left( \left\lbrack {i_{r}j_{s}} \right\rbrack \right)} = {{ɛ\left( \left\lbrack {i_{r}j_{s}} \right\rbrack \right)} + {\sum\limits_{k \neq i \neq j}{\max\limits_{l}{ɛ\left( {\left\lbrack {i_{r}j_{s}} \right\rbrack,k} \right)}}}}} & {{Equation}\quad 24} \\{{ɛ_{\min}\left( \left\lbrack {i_{r}j_{s}} \right\rbrack \right)} = {{ɛ\left( \left\lbrack {i_{r}j_{s}} \right\rbrack \right)} + {\sum\limits_{k \neq i \neq j}{\min\limits_{t}{ɛ\left( {\left\lbrack {i_{r}j_{s}} \right\rbrack,k_{l}} \right)}}}}} & {{Equation}\quad 25} \\{{ɛ_{\max}\left( \left\lbrack {i_{u}j_{v}} \right\rbrack \right)} = {{ɛ\left( \left\lbrack {i_{u}j_{v}} \right\rbrack \right)} + {\sum\limits_{k \neq i \neq j}{\max\limits_{t}{ɛ\left( {\left\lbrack {i_{u}j_{v}} \right\rbrack,k_{t}} \right)}}}}} & {{Equation}\quad 26} \\{{ɛ_{\min}\left( \left\lbrack {i_{u}j_{v}} \right\rbrack \right)} = {{ɛ\left( \left\lbrack {i_{u}j_{v}} \right\rbrack \right)} + {\sum\limits_{k \neq i \neq j}{\min\limits_{t}{ɛ\left( {\left\lbrack {i_{u}j_{v}} \right\rbrack,k_{l}} \right)}}}}} & {{Equation}\quad 27}\end{matrix}$

These metrics were selected because they yield ratios of the occurrenceof dead-ending matrix elements to the total occurrence of elements thatare higher than any of the other metrics tested. For example, there arevery few matrix elements (˜2%) for which q_(rs)>0.98, yet these elementsproduce 30-40% of all of the dead-ending pairs.

Accordingly, the first-order doubles criterion is applied only to thosedoubles for which q_(rs)>0.98 and q_(uv)>0.99. The sample data analysespredict that by using these two metrics, as many as half of thedead-ending elements may be found by evaluating only two to five percentof the reduced matrix.

Generally, as is more fully described below, single and double DEE,using either or both of original DEE and Goldstein DEE, is run until nofurther elimination is possible. Usually, convergence is not complete,and further elimination must occur to achieve convergence. This isgenerally done using “super residue” DEE.

In a preferred embodiment, additional DEE computation is done by thecreation of “super residues” or “unification”, as is generally describedin Desmet, Nature 356:539-542 (1992); Desmet, et al, The Protein FoldingProblem and Tertiary Structure Prediction, Ch. 10:1-49 (1994);Goldstein, et al., supra. A super residue is a combination of two ormore variable residue positions which is then treated as a singleresidue position. The super residue is then evaluated in singles DEE,and doubles DEE, with either other residue positions or super residues.The disadvantage of super residues is that there are many more rotamericstates which must be evaluated; that is, if a first variable residueposition has 5 possible rotamers, and a second variable residue positionhas 4 possible rotamers, there are 20 possible super residue rotamerswhich must be evaluated. However, these super residues may be eliminatedsimilar to singles, rather than being flagged like pairs.

The selection of which positions to combine into super residues may bedone in a variety of ways. In general, random selection of positions forsuper residues results in inefficient elimination, but it can be done,although this is not preferred. In a preferred embodiment, the firstevaluation is the selection of positions for a super residue is thenumber of rotamers at the position. If the position has too manyrotamers, it is never unified into a super residue, as the computationbecomes too unwieldy. Thus, only positions with fewer than about 100,000rotamers are chosen, with less than about 50,000 being preferred andless than about 10,000 being especially preferred.

In a preferred embodiment, the evaluation of whether to form a superresidue is done as follows. All possible rotamer pairs are ranked usingEquation 28, and the rotamer pair with the highest number is chosen forunification: $\begin{matrix}\log^{\frac{{fraction}\quad{of}\quad{flagged}\quad{pairs}}{({{number}\quad{of}\quad{super}\quad{rotamers}\quad{resulting}\quad{from}\quad{the}\quad{potential}\quad{unification}})}} & {{Equation}\quad 28}\end{matrix}$

Equation 28 is looking for the pair of positions that has the highestfraction or percentage of flagged pairs but the fewest number of superrotamers. That is, the pair that gives the highest value for Equation 28is preferably chosen. Thus, if the pair of positions that has thehighest number of flagged pairs but also a very large number of superrotamers (that is, the number of rotamers at position i times the numberof rotamers at position j), this pair may not be chosen (although itcould) over a lower percentage of flagged pairs but fewer superrotamers.

In an alternate preferred embodiment, positions are chosen for superresidues that have the highest average energy; that is, for positions iand j, the average energy of all rotamers for i and all rotamers for jis calculated, and the pair with the highest average energy is chosen asa super residue.

Super residues are made one at a time, preferably. After a super residueis chosen, the singles and doubles DEE computations are repeated wherethe super residue is treated as if it were a regular residue. As forsingles and doubles DEE, the elimination of rotamers in the superresidue DEE will alter the minimum energy calculations of DEE. Thus,repeating singles and/or doubles DEE can result in further eliminationof rotamers.

FIG. 3 is a detailed illustration of the processing operationsassociated with a ranking module 34 of the invention. The calculationand storage of the singles and doubles energies 70 is the first step,although these may be recalculated every time. Step 72 is the optionalapplication of a cutoff, where singles or doubles energies that are toohigh are eliminated prior to further processing. Either or both oforiginal singles DEE 74 or Goldstein singles DEE 76 may be done, withthe elimination of original singles DEE 74 being generally preferred.Once the singles DEE is run, original doubles (78) and/or Goldsteindoubles (80) DEE is run. Super residue DEE is then generally run, eitheroriginal (82) or Goldstein (84) super residue DEE. This preferablyresults in convergence at a global optimum sequence. As is depicted inFIG. 3, after any step any or all of the previous steps can be rerun, inany order.

The addition of super residue DEE to the computational processing, withrepetition of the previous DEE steps, generally results in convergenceat the global optimum. Convergence to the global optimum is guaranteedif no cutoff applications are made, although generally a global optimumis achieved even with these steps. In a preferred embodiment, DEE is rununtil the global optimum sequence is found. That is, the set ofoptimized protein sequences contains a single member, the globaloptimum.

In a preferred embodiment, the various DEE steps are run until amanagable number of sequences is found, i.e. no further processing isrequired. These sequences represent a set of optimized proteinsequences, and they can be evaluated as is more fully described below.Generally, for computational purposes, a manageable number of sequencesdepends on the length of the sequence, but generally ranges from about 1to about 10¹⁵ possible rotamer sequences.

Alternatively, DEE is run to a point, resulting in a set of optimizedsequences (in this context, a set of remainder sequences) and thenfurther compututational processing of a different type may be run. Forexample, in one embodiment, direct calculation of sequence energy asoutlined above is done on the remainder possible sequences.Alternatively, a Monte Carlo search can be run.

In a preferred embodiment, the computation processing need not comprisea DEE computational step. In this embodiment, a Monte Carlo search isundertaken, as is known in the art. See Metropolis et al., J. Chem.Phys. 21:1087 (1953), hereby incorporated by reference. In thisembodiment, a random sequence comprising random rotamers is chosen as astart point. In one embodiment, the variable residue positions areclassified as core, boundary or surface residues and the set ofavailable residues at each position is thus defined. Then a randomsequence is generated, and a random rotamer for each amino acid ischosen. This serves as the starting sequence of the Monte Carlo search.A Monte Carlo search then makes a random jump at one position, either toa different rotamer of the same amino acid or a rotamer of a differentamino acid, and then a new sequence energy (E_(total sequence)) iscalculated, and if the new sequence energy meets the Boltzmann criteriafor acceptance, it is used as the starting point for another jump. Ifthe Boltzmann test fails, another random jump is attempted from theprevious sequence. In this way, sequences with lower and lower energiesare found, to generate a set of low energy sequences.

If computational processing results in a single global optimum sequence,it is frequently preferred to generate additional sequences in theenergy neighborhood of the global solution, which may be ranked. Theseadditional sequences are also optimized protein sequences. Thegeneration of additional optimized sequences is generally preferred soas to evaluate the differences between the theoretical and actualenergies of a sequence. Generally, in a preferred embodiment, the set ofsequences is at least about 75% homologous to each other, with at leastabout 80% homologous being preferred, at least about 85% homologousbeing particularly preferred, and at least about 90% being especiallypreferred. In some cases, homology as high as 95% to 98% is desirable.Homology in this context means sequence similarity or identity, withidentity being preferred. Identical in this context means identicalamino acids at corresponding positions in the two sequences which arebeing compared. Homology in this context includes amino acids which areidentical and those which are similar (functionally equivalent). Thishomology will be determined using standard techniques known in the art,such as the Best Fit sequence program described by Devereux, et al.,Nucl. Acid Res., 12:387-395 (1984), or the BLASTX program (Altschul, etal., J. Mol. Biol., 215:403-410 (1990)) preferably using the defaultsettings for either. The alignment may include the introduction of gapsin the sequences to be aligned. In addition, for sequences which containeither more or fewer amino acids than an optimum sequence, it isunderstood that the percentage of homology will be determined based onthe number of homologous amino acids in relation to the total number ofamino acids. Thus, for example, homology of sequences shorter than anoptimum will be determined using the number of amino acids in theshorter sequence.

Once optimized protein sequences are identified, the processing of FIG.2 optionally proceeds to step 56 which entails searching the proteinsequences. This processing may be implemented with the search module 36.The search module 36 is a set of computer code that executes a searchstrategy. For example, the search module 36 may be written to execute aMonte Carlo search as described above. Starting with the globalsolution, random positions are changed to other rotamers allowed at theparticular position, both rotamers from the same amino acid and rotamersfrom different amino acids. A new sequence energy (E_(total sequence))is calculated, and if the new sequence energy meets the Boltzmanncriteria for acceptance, it is used as the starting point for anotherjump. See Metropolis et al., 1953, supra, hereby incorporated byreference. If the Boltzmann test fails, another random jump is attemptedfrom the previous sequence. A list of the sequences and their energiesis maintained during the search. After a predetermined number of jumps,the best scoring sequences may be output as a rank-ordered list.Preferably, at least about 10⁶ jumps are made, with at least about 10⁷jumps being preferred and at least about 10⁸ jumps being particularlypreferred. Preferably, at least about 100 to 1000 sequences are saved,with at least about 10,000 sequences being preferred and at least about100,000 to 1,000,000 sequences being especially preferred. During thesearch, the temperature is preferably set to 1000 K.

Once the Monte Carlo search is over, all of the saved sequences arequenched by changing the temperature to 0 K, and fixing the amino acididentity at each position. Preferably, every possible rotamer jump forthat particular amino acid at every position is then tried.

The computational processing results in a set of optimized proteinsequences. These optimized protein sequences are generally, but notalways, significantly different from the wild-type sequence from whichthe backbone was taken. That is, each optimized protein sequencepreferably comprises at least about 5-10% variant amino acids from thestarting or wild-type sequence, with at least about 15-20% changes beingpreferred and at least about 30% changes being particularly preferred.

These sequences can be used in a number of ways. In a preferredembodiment, one, some or all of the optimized protein sequences areconstructed into designed proteins, as show with step 58 of FIG. 2.Thereafter, the protein sequences can be tested, as shown with step 60of the FIG. 2. Generally, this can be done in one of two ways.

In a preferred embodiment, the designed proteins are chemicallysynthesized as is known in the art. This is particularly useful when thedesigned proteins are short, preferably less than 150 amino acids inlength, with less than 100 amino acids being preferred, and less than 50amino acids being particularly preferred, although as is known in theart, longer proteins can be made chemically or enzymatically.

In a preferred embodiment, particularly for longer proteins or proteinsfor which large samples are desired, the optimized sequence is used tocreate a nucleic acid such as DNA which encodes the optimized sequenceand which can then be cloned into a host cell and expressed. Thus,nucleic acids, and particularly DNA, can be made which encodes eachoptimized protein sequence. This is done using well known procedures.The choice of codons, suitable expression vectors and suitable hostcells will vary depending on a number of factors, and can be easilyoptimized as needed.

Once made, the designed proteins are experimentally evaluated and testedfor structure, function and stability, as required. This will be done asis known in the art, and will depend in part on the original proteinfrom which the protein backbone structure was taken. Preferably, thedesigned proteins are more stable than the known protein that was usedas the starting point, although in some cases, if some constaints areplaced on the methods, the designed protein may be less stable. Thus,for example, it is possible to fix certain residues for alteredbiological activity and find the most stable sequence, but it may stillbe less stable than the wild type protein. Stable in this context meansthat the new protein retains either biological activity or conformationpast the point at which the parent molecule did. Stability includes, butis not limited to, thermal stability, i.e. an increase in thetemperature at which reversible or irreversible denaturing starts tooccur; proteolytic stability, i.e. a decrease in the amount of proteinwhich is irreversibly cleaved in the presence of a particular protease(including autolysis); stability to alterations in pH or oxidativeconditions; chelator stability; stability to metal ions; stability tosolvents such as organic solvents, surfactants, formulation chemicals;etc.

In a preferred embodiment, the modelled proteins are at least about 5%more stable than the original protein, with at least about 10% beingpreferred and at least about 20-50% being especially preferred.

The results of the testing operations may be computationally assessed,as shown with step 62 of FIG. 2. An assessment module 38 may be used inthis operation. That is, computer code may be prepared to analyze thetest data with respect to any number of metrices.

At this processing juncture, if the protein is selected (the yes branchat block 64) then the protein is utilized (step 66), as discussed below.If a protein is not selected, the accumulated information may be used toalter the ranking module 34, and/or step 56 is repeated and moresequences are searched.

In a preferred embodiment, the experimental results are used for designfeedback and design optimization.

Once made, the proteins of the invention find use in a wide variety ofapplications, as will be appreciated by those in the art, ranging fromindustrial to pharmacological uses, depending on the protein. Thus, forexample, proteins and enzymes exhibiting increased thermal stability maybe used in industrial processes that are frequently run at elevatedtemperatures, for example carbohydrate processing (includingsaccharification and liquefaction of starch to produce high fructosecorn syrup and other sweetners), protein processing (for example the useof proteases in laundry detergents, food processing, feed stockprocessing, baking, etc.), etc. Similarly, the methods of the presentinvention allow the generation of useful pharmaceutical proteins, suchas analogs of known proteinaceous drugs which are more thermostable,less proteolytically sensitive, or contain other desirable changes.

The following examples serve to more fully describe the manner of usingthe above-described invention, as well as to set forth the best modescontemplated for carrying out various aspects of the invention. It isunderstood that these examples in no way serve to limit the true scopeof this invention, but rather are presented for illustrative purposes.All references cited herein are explicitly incorporated by reference.

EXAMPLES Example 1

Protein Design Using van der Waals and Atomic Solvation ScoringFunctions with DEE

A cyclical design strategy was developed that couples theory,computation and experimental testing in order to address the problems ofspecificity and learning (FIG. 4). Our protein design automation (PDA)cycle is comprised of four components: a design paradigm, a simulationmodule, experimental testing and data analysis. The design paradigm isbased on the concept of inverse folding (Pabo, Nature 301:200 (1983);Bowie, et al., Science 253:164-170 (1991)) and consists of the use of afixed backbone onto which a sequence of side-chain rotamers can beplaced, where rotamers are the allowed conformations of amino acid sidechains (Ponder, et al., (1987) (supra)). Specific tertiary interactionsbased on the three dimensional juxtaposition of atoms are used todetermine the sequences that will potentially best adopt the targetfold. Given a backbone geometry and the possible rotamers allowed foreach residue position as input, the simulation must generate as output arank ordered list of solutions based on a cost function that explicitlyconsiders the atom positions in the various rotamers. The principleobstacle is that a fixed backbone comprised of n residues and m possiblerotamers per residue (all rotamers of all allowed amino acids) resultsin m^(n) possible arrangements of the system, an immense number for evensmall design problems. For example, to consider 50 rotamers at 15positions results in over 10²⁵ sequences, which at an evaluation rate of10⁹ sequences per second (far beyond current capabilities) would take10⁹ years to exhaustively search for the global minimum. The synthesisand characterization of a subset of amino acid sequences presented bythe simulation module generates experimental data for the analysismodule. The analysis section discovers correlations between calculableproperties of the simulated structures and the experimental observables.The goal of the analysis is to suggest quantitative modifications to thesimulation and in some cases to the guiding design paradigm. In otherwords, the cost function used in the simulation module describes atheoretical potential energy surface whose horizontal axis comprises allpossible solutions to the problem at hand. This potential energy surfaceis not guaranteed to match the actual potential energy surface which isdetermined from the experimental data. In this light, the goal of theanalysis becomes the correction of the simulation cost function in orderto create better agreement between the theoretical and actual potentialenergy surfaces. If such corrections can be found, then the output ofsubsequent simulations will be amino acid sequences that better achievethe target properties. This design cycle is generally applicable to anyprotein system and, by removing the subjective human component, allows alargely unbiased approach to protein design, i.e. protein designautomation.

The PDA side-chain selection algorithm requires as input a backbonestructure defining the desired fold. The task of designing a sequencethat takes this fold can be viewed as finding an optimal arrangement ofamino acid side chains relative to the given backbone. It is notsufficient to consider only the identity of an amino acid whenevaluating sequences. In order to correctly account for the geometricspecificity of side-chain placement, all possible conformations of eachside chain must also be examined. Statistical surveys of the proteinstructure database (Ponder, et al., supra) have defined a discrete setof allowed conformations, called rotamers, for each amino acid sidechain. We use a rotamer library based on the Ponder and Richards libraryto define allowed conformations for the side chains in PDA.

Using a rotamer description of side chains, an optimal sequence for abackbone can be found by screening all possible sequences of rotamers,where each backbone position can be occupied by each amino acid in allits possible rotameric states. The discrete nature of rotamer setsallows a simple calculation of the number of rotamer sequences to betested. A backbone of length n with m possible rotamers per positionwill have m^(n) possible rotamer sequences. The size of the search spacegrows exponentially with sequence length which for typical values of nand m render intractable an exhaustive search. This combinatorial“explosion” is the primary obstacle to be overcome in the simulationphase of PDA.

Simulation algorithm: An extension of the Dead End Elimination (DEE)theorem was developed (Desmet, et al., (1992 ((supra); Desmet, et al.,(1994) (supra); Goldstein, (1994) (supra) to solve the combinatorialsearch problem. The DEE theorem is the basis for a very fast discretesearch algorithm that was designed to pack protein side chains on afixed backbone with a known sequence. Side chains are described byrotamers and an atomistic forcefield is used to score rotamerarrangements. The DEE theorem guarantees that if the algorithmconverges, the global optimum packing is found. The DEE method isreadily extended to our inverse folding design paradigm by releasing theconstraint that a position is limited to the rotamers of a single aminoacid. This extension of DEE greatly increases the number of rotamers ateach position and requires a significantly modified implementation toensure convergence, described more fully herein. The guarantee that onlythe global optimum will be found is still valid, and in our extensionmeans that the globally optimal sequence is found in its optimalconformation.

DEE was implemented with a novel addition to the improvements suggestedby Goldstein (Goldstein, (1994) (supra)). As has been noted, exhaustiveapplication of the R=1 rotamer elimination and R=0 rotamer-pair flaggingequations and limited application of the R=1 rotamer-pair flaggingequation routinely fails to find the global solution. This problem canbe overcome by unifying residues into “super residues” (Desmet, et al.,(1992 ((supra); Desmet, et al., (1994) (supra); Goldstein, (1994)(supra). However, unification can cause an unmanageable increase in thenumber of super rotamers per super residue position and can lead tointractably slow performance since the computation time for applying theR=1 rotamer-pair flagging equation increases as the fourth power of thenumber of rotamers. These problems are of particular importance forprotein design applications given the requirement for large numbers ofrotamers per residue position. In order to limit memory size and toincrease performance, we developed a heuristic that governs whichresidues (or super residues) get unified and the number of rotamer (orsuper rotamer) pairs that are included in the R=1 rotamer-pair flaggingequation. A program called PDA_DEE was written that takes a list ofrotamer energies from PDA_SETUP and outputs the global minimum sequencein its optimal conformation with its energy.

Scoring functions: The rotamer library used was similar to that used byDesmet and coworkers (Desmet, et al., (1992) (supra)). χ₁ and χ₂ anglevalues of rotamers for all amino acids except Met, Arg and Lys wereexpanded plus and minus one standard deviation about the mean value fromthe Ponder and Richards library (supra) in order to minimize possibleerrors that might arise from the discreteness of the library. c₃ and c₄angles that were undetermined from the database statistics were assignedvalues of 0° and 180° for Gln and 60°, −60° and 180° for Met, Lys andArg. The number of rotamers per amino acid is: Gly, 1; Ala, 1; Val, 9;Ser, 9; Cys, 9; Thr, 9; Leu, 36; Ile, 45; Phe, 36; Tyr, 36; Trp, 54;His, 54; Asp, 27; Asn, 54; Glu, 69; Gln, 90; Met, 21; Lys, 57; Arg, 55.The cyclic amino acid Pro was not included in the library. Further, allrotamers in the library contained explicit hydrogen atoms. Rotamers werebuilt with bond lengths and angles from the Dreiding forcefield (Mayo,et al., J. Phys. Chem. 94:8897 (1990)).

The initial scoring function for sequence arrangements used in thesearch was an atomic van der Waals potential. The van der Waalspotential reflects excluded volume and steric packing interactions whichare important determinants of the specific three dimensional arrangementof protein side chains. A Lennard-Jones 12-6 potential with radii andwell depth parameters from the Dreiding forcefield was used for van derWaals interactions. Non-bonded interactions for atoms connected by oneor two bonds were not considered. van der Waals radii for atomsconnected by three bonds were scaled by 0.5. Rotamer/rotamer pairenergies and rotamer/template energies were calculated in a mannerconsistent with the published DEE algorithm (Desmet, et al., (1992)(supra)). The template consisted of the protein backbone and the sidechains of residue positions not to be optimized. No intra-side-chainpotentials were calculated. This scheme scored the packing geometry andeliminated bias from rotamer internal energies. Prior to DEE, allrotamers with template interaction energies greater than 25 kcal/molwere eliminated. Also, any rotamer whose interaction was greater than 25kcal/mol with all other rotamers at another residue position waseliminated. A program called PDA_SETUP was written that takes as inputbackbone coordinates, including side chains for positions not optimized,a rotamer library, a list of positions to be optimized and a list of theamino acids to be considered at each position. PDA_SETUP outputs a listof rotamer/template and rotamer/rotamer energies.

The pairwise solvation potential was implemented in two components toremain consistent with the DEE methodology: rotamer/template androtamer/rotamer burial. For the rotamer/template buried area, thereference state was defined as the rotamer in question at residue i withthe backbone atoms only of residues i−1, i and i+1. The area of the sidechain was calculated with the backbone atoms excluding solvent but notcounted in the area. The folded state was defined as the area of therotamer in question at residue i, but now in the context of the entiretemplate structure including non-optimized side chains. Therotamer/template buried area is the difference between the reference andthe folded states. The rotamer/rotamer reference area is simply the sumof the areas of the isolated rotamers. The folded state is the area ofthe two rotamers placed in their relative positions on the proteinscaffold but with no template atoms present. The Richards definition ofsolvent accessible surface area (Lee & Richards, 1971, supra) was used,with a probe radius of 1.4 Å and Drieding van der Waals radii. Carbonand sulfur, and all attached hydrogens, were considered nonpolar.Nitrogen and oxygen, and all attached hydrogens, were considered polar.Surface areas were calculated with the Connolly algorithm using a dotdensity of 10 Å⁻² (Connolly, (1983) (supra)). In more recentimplementations of PDA_SETUP, the MSEED algorithm of Scheraga has beenused in conjunction with the Connolly algorithm to speed up thecalculation (Perrot, et al., J. Comput. Chem. 13:1-11 (1992).

Monte Carlo search: Following DEE optimization, a rank ordered list ofsequences was generated by a Monte Carlo search in the neighborhood ofthe DEE solution. This list of sequences was necessary because ofpossible differences between the theoretical and actual potentialsurfaces. The Monte Carlo search starts at the global minimum sequencefound by DEE. A residue was picked randomly and changed to a randomrotamer selected from those allowed at that site. A new sequence energywas calculated and, if it met the Boltzman criteria for acceptance, thenew sequence was used as the starting point for another jump. If theBoltzman test failed, then another random jump was attempted from theprevious sequence. A list of the best sequences found and their energieswas maintained throughout the search. Typically 10⁶ jumps were made, 100sequences saved and the temperature was set to 1000 K. After the searchwas over, all of the saved sequences were quenched by changing thetemperature to 0 K, fixing the amino acid identity and trying everypossible rotamer jump at every position. The search was implemented in aprogram called PDA_MONTE whose input was a global optimum solution fromPDA_DEE and a list of rotamer energies from PDA_SETUP. The output was alist of the best sequences rank ordered by their score. PDA_SETUP,PDA_DEE and PDA_MONTE were implemented in the CERIUS2 softwaredevelopment environment (Biosym/Molecular Simulations, San Diego,Calif.).

PDA_SETUP, PDA_DEE, and PDA_MONTE were implemented in the CERIUS2software development environment (Biosym/Molecular Simulations, SanDiego, Calif.).

Model system and experimental testing: The homodimeric coiled coil of αhelices was selected as the initial design target. Coiled coils arereadily synthesized by solid phase techniques and their helicalsecondary structure and dimeric tertiary organization easecharacterization. Their sequences display a seven residue periodic HPpattern called a heptad repeat, (a•b•c•d•e•f•g) (Cohen & Parry, ProteinsStruc. Func. Genet. 7:1-15 (1990)). The a and d positions are usuallyhydrophobic and buried at the dimer interface while the other positionsare usually polar and solvent exposed (FIG. 5). The backbone needed forinput to the simulation module was taken from the crystal structure ofGCN4-p1 (O'Shea, et al., Science 254:539 (1991)). The 16 hydrophobic aand d positions were optimized in the crystallographically determinedfixed field of the rest of the protein. Homodimer sequence symmetry wasenforced, only rotamers from hydrophobic amino acids (A, V, L, I, M, F,Y and W) were considered and the asparagine at an a position, Asn 16,was not optimized.

Homodimeric coiled coils were modeled on the backbone coordinates ofGCN4-p1, PDB ascension code 2ZTA (Bernstein, et al., J. Mol. Biol.112:535 (1977); O'Shea, et al., supra). Atoms of all side chains notoptimized were left in their crystallographically determined positions.The program BIOGRAF (Biosym/Molecular Simulations, San Diego, Calif.)was used to generate explicit hydrogens on the structure which was thenconjugate gradient minimized for 50 steps using the Dreiding forcefield.The HP pattern was enforced by only allowing hydrophobic amino acidsinto the rotamer groups for the optimized a and d positions. Thehydrophobic group consisted of Ala, Val, Leu, Ile, Met, Phe, Tyr and Trpfor a total of 238 rotamers per position. Homodimer symmetry wasenforced by penalizing by 100 kcal/mol rotamer pairs that violatesequence symmetry. Different rotamers of the same amino acid wereallowed at symmetry related positions. The asparagine that occupies thea position at residue 16 was left in the template and not optimized. A10⁶ step Monte Carlo search run at a temperature of 1000 K generated thelist of candidate sequences rank ordered by their score. To testreproducibility, the search was repeated three times with differentrandom number seeds and all trials provided essentially identicalresults. The Monte Carlo searches took about 90 minutes. Allcalculations in this work were performed on a Silicon Graphics 200 MHzR4400 processor.

Optimizing the 16 a and d positions each with 238 possible hydrophobicrotamers results in 238¹⁶ or 10³⁸ rotamer sequences. The DEE algorithmfinds the global optimum in three minutes, including rotamer energycalculation time. The DEE solution matches the naturally occurringGCN4-p1 sequence of a and d residues for all of the 16 positions. A onemillion step Monte Carlo search run at a temperature of 1000 K generatedthe list of sequences rank ordered by their score. To testreproducibility, the search was repeated three times with differentrandom number seeds and all trials provided essentially identicalresults. The second best sequence is a Val 30 to Ala mutation and liesthree kcal/mol above the ground state sequence. Within the top 15sequences up to six mutations from the ground state sequence aretolerated, indicating that a variety of packing arrangements areavailable even for a small coiled coil. Eight sequences with a range ofstabilities were selected for experimental testing, including six fromthe top 15 and two more about 15 kcal/mol higher in energy, the 56th and70th in the list (Table 1) (SEQ ID NO:23 to SEQ ID NO:30). TABLE 1 NameSequence Rank Energy PDA-3H^(b) (SEQ ID NO:23) RMKQLEDKVEELLSKN 1 −118.1YHLENEVARLKKLVGE R PDA-3A (SEQ ID NO:24) RMKQLEDKVEELLSKN 2 −115.3YHLENEVARLKKLAGE R PDA-3G (SEQ ID NO:25) RMKQLEDKVEELLSKN 5 −112.8YHLENEMARLKKLVGE R PDA-3B (SEQ ID NO:26) RLKQMEDKVEELLSKN 6 −112.6YHLENEVARLKKLVGE R PDA-3D (SEQ ID NO:27) RLKQMEDKVEELLSKN 13 −109.7YHLENEVARLKKLAGE R PDA-3C (SEQ ID NO:28) RMKQWEDKAEELLSKN 14 −109.6YHLENEVARLKKLVGE R PDA-3F (SEQ ID NO:29) RMKQFEDKVEELLSKN 56 −103.9YHLENEVARLKKLVGE R PDA-3E (SEQ ID NO:30) RMKQLEDKVEELLSKN 70 −103.1YHAENEVARLKKLVGE R

Thirty-three residue peptides were synthesized on an Applied BiosystemsModel 433A peptide synthesizer using Fmoc chemistry, HBTU activation anda modified Rink amide resin from Novabiochem. Standard 0.1 mmol couplingcycles were used and amino termini were acetylated. Peptides werecleaved from the resin by treating approximately 200 mg of resin with 2mL trifluoroacetic acid (TFA) and 100 μL water, 100 μL thioanisole, 50μL ethanedithiol and 150 mg phenol as scavengers. The peptides wereisolated and purified by precipitation and repeated washing with coldmethyl tert-butyl ether followed by reverse phase HPLC on a Vydac C8column (25 cm by 22 mm) with a linear acetonitrile-water gradientcontaining 0.1% TFA. Peptides were then lyophilized and stored at −20°C. until use. Plasma desorption mass spectrometry found all molecularweights to be within one unit of the expected masses.

Circular dichroism CD spectra were measured on an Aviv 62DS spectrometerat pH 7.0 in 50 mM phosphate, 150 mM NaCl and 40 μM peptide. A 1 mmpathlength cell was used and the temperature was controlled by athermoelectric unit. Thermal melts were performed in the same bufferusing two degree temperature increments with an averaging time of 10 sand an equilibration time of 90 s. T_(m) values were derived from theellipticity at 222 nm ([θ]₂₂₂) by evaluating the minimum of thed[θ]₂₂₂/dT⁻¹ versus T plot (Cantor & Schimmel, Biophysical Chemistry.New York: W. H. Freemant and Company, 1980). The T_(m)'s werereproducible to within one degree. Peptide concentrations weredetermined from the tyrosine absorbance at 275 nm (Huyghues-Despointes,et al., supra).

Size exclusion chromatography: Size exclusion chromatography wasperformed with a Synchropak GPC 100 column (25 cm by 4.6 mm) at pH 7.0in 50 mM phosphate and 150 mM NaCl at 0° C. GCN4-p1 and p-LI (Harbury,et al., Science 262:1401 (1993)) were used as size standards. 10 μlinjections of 1 mM peptide solution were chromatographed at 0.20 ml/minand monitored at 275 nm. Peptide concentrations were approximately 60 μMas estimated from peak heights. Samples were run in triplicate.

The designed a and d sequences were synthesized as above using theGCN4-p1 sequence for the b•c and e•f•g positions. Standard solid phasetechniques were used and following HPLC purification, the identities ofthe peptides were confirmed by mass spectrometry. Circular dichroismspectroscopy (CD) was used to assay the secondary structure and thermalstability of the designed peptides. The CD spectra of all the peptidesat 1° C. and a concentration of 40 mM exhibit minima at 208 and 222 nmand a maximum at 195 nm, which are diagnostic for α helices (data notshown). The ellipticity values at 222 nm indicate that all of thepeptides are >85% helical (approximately −28000 deg cm²/dmol), with theexception of PDA-3C (SEQ ID NO:28) which is 75% helical at 40 mM butincreases to 90% helical at 170 mM (Table 2). TABLE 2 CD data andcalculated structural properties of the PDA peptides. E_(MC) E_(CQ)E_(CG) E_(vdW) -[θ]₂₂₂ (deg T_(m) (kcal/ ΔA_(np) ΔA_(p) Vol Rot (kcal/(kcal/ (kcal/ Name cm²/dmol) (° C.) mol) (Å²) (Å²) (Å³) bonds mol) mol)mol) Npb Pb PDA-3H 33000 57 −118.1 2967 2341 1830 28 −234 −308 409 207128 (SEQ ID NO: 23) PDA-3A 30300 48 −115.3 2910 2361 1725 26 −232 −312400 203 128 (SEQ ID NO: 24) PDA-3B 28200 47 −112.6 2977 2372 1830 28−242 −306 379 210 127 (SEQ ID NO: 26) PDA-3G 30700 47 −112.8 3003 23831878 32 −240 −309 439 212 128 (SEQ ID NO: 25) PDA-3F 28800 39 −103.93000 2336 1872 28 −188 −302 420 212 128 (SEQ ID NO: 29) PDA-3D 27800 39−109.7 2920 2392 1725 26 −240 −310 370 206 127 (SEQ ID NO: 27) PDA-3C24100 26 −109.6 2878 2400 1843 26 −149 −304 398 215 129 (SEQ ID NO: 28)PDA-3E 27500 24 −103.1 2882 2361 1674 24 −179 −309 411 203 127 (SEQ IDNO: 30)*E_(MC) is the Monte Carlo energy;ΔA_(np) and ΔA_(p) are the changes in solvent accessible non-polar andpolar surface areas upon folding, respectively;E_(CQ) is the electrostatic energy using equilibrated charges;E_(CG) is the electrostatic energy using Gasteiger charges;E_(vdW) is the van der Waals energy;Vol is the side chain van der Waals volume;Rot bonds is the number of side chain rotatable bonds (excluding methylrotors);Npb and Pb are the number of buried non-polar and polar atoms,respectively.

The melting temperatures (T_(m)'s) show a broad range of values (datanot shown), with 6 of the 8 peptides melting at greater thanphysiological temperature. Also, the T_(m)'s were not correlated to thenumber of sequence differences from GCN4-p1. Single amino acid changesresulted in some of the most and least stable peptides, demonstratingthe importance of specificity in sequence selection.

Size exclusion chromatography confirmed the dimeric nature of thesedesigned peptides. Using coiled coil peptides of known oligomerizationstate as standards, the PDA peptides (SEQ ID NO:23 to SEQ ID NO:30)migrated as dimers. This result is consistent with the appearance ofβ-branched residues at a positions and leucines at d positions, whichhave been shown previously to favor dimerization over other possibleoligomerization states (Harbury, et al., supra).

The characterization of the PDA peptides (SEQ ID NO:23 to SEQ ID NO:30)demonstrates the successful design of several stable dimeric helicalcoiled coils. The sequences were automatically generated in the contextof the design paradigm by the simulation module using well-definedinputs that explicitly consider the HP patterning and steric specificityof protein structure. Two dimensional nuclear magnetic resonanceexperiments aimed at probing the specificity of the tertiary packing arethe focus of further studies on these peptides. Initial experiments showsignificant protection of amide protons from chemical exchange andchemical shift dispersion comparable to GCN4-p1 (unpublished results)(Oas, et al., Biochemistry 29:2891 (1990)); Goodman & Kim, Biochem.30:11615 (1991)).

Data analysis and design feedback A detailed analysis of thecorrespondence between the theoretical and experimental potentialsurfaces, and hence an estimate of the accuracy of the simulation costfunction, was enabled by the collection of experimental data. Usingthermal stability as a measure of design performance, meltingtemperatures of the PDA peptides (SEQ ID NO:23 to SEQ ID NO:30) wereplotted against the sequence scores found in the Monte Carlo search(FIG. 6). The modest correlation, 0.67, in the plot shows that while anexclusively van der Waals scoring function can screen for stablesequences, it does not accurately predict relative stabilities. In orderto address this issue, correlations between calculated structuralproperties and T_(m)'s were systematically examined using quantitativestructure activity relationships (QSAR), which is a statisticaltechnique commonly used in structure based drug design (Hopfinger, J.Med. Chem. 28:1133 (1985)).

Table 2 lists various molecular properties of the PDA peptides (SEQ IDNO:23 to SEQ ID NO:30) in addition to the van der Waals based MonteCarlo scores and the experimentally determined T_(m)'s. A wide range ofproperties was examined, including molecular mechanics components, suchas electrostatic energies, and geometric measures, such as volume. Thegoal of QSAR is the generation of equations that closely approximate theexperimental quantity, in this case T_(m), as a function of thecalculated properties. Such equations suggest which properties can beused in an improved cost function. The PDA analysis module employsgenetic function approximation (GFA) (Rogers & Hopfinger, J. Chem. Inf.Comput. Scie. 34:854 (1994)), a novel method to optimize QSAR equationsthat selects which properties are to be included and the relativeweightings of the properties using a genetic algorithm. GFA accomplishesan efficient search of the space of possible equations and robustlygenerates a list of equations ranked by their correlation to the data.

Equations are scored by lack of fit (LOF), a weighted least square errormeasure that resists overfitting by penalizing equations with more terms(Rogers & Hopfinger, supra). GFA optimizes both the length and thecomposition of the equations and, by generating a set of QSAR equations,clarifies combinations of properties that fit well and properties thatrecur in many equations. All of the top five equations that correct thesimulation energy (E_(MC)) contain burial of nonpolar surface area,ΔA_(np) (Table 3). TABLE 3 Top five QSAR equations generated by GFA withLOF, correlation coefficient and cross validation scores. QSAR equationLOF r² CV r² −1.44 * E_(MC) + 0.14 * ΔA_(np) − 0.73 * Npb 16.23 .98 .78−1.78 * E_(MC) + 0.20 * ΔA_(np) − 2.43 * Rot 23.13 .97 .75 −1.59 *E_(MC) + 0.17 * ΔA_(np) − 0.05 * Vol 24.57 .97 .36 −1.54 * E_(MC) +0.11 * ΔA_(np) 25.45 .91 .80 −1.60 * E_(MC) + 0.09 * ΔA_(np) − 0.12 *ΔA_(p) 33.88 .96 .90ΔA_(np) and ΔA_(p) are nonpolar and polar surface buried upon folding,respectively.Vol is side chain volume,Npb is the number of buried nonpolar atoms andRot is the number of buried rotatable bonds.

The presence of ΔA_(np) in all of the top equations, in addition to thelow LOF of the QSAR containing only E_(MC) and ΔA_(np), stronglyimplicates nonpolar surface burial as a critical property for predictingpeptide stability. This conclusion is not surprising given the role ofthe hydrophobic effect in protein energetics (Dill, Biochem. 29:7133(1990)).

Properties were calculated using BIOGRAF and the Dreiding forcefield.Solvent accessible surface areas were calculated with the Connollyalgorithm (Connolly, (1983) (supra)) using a probe radius of 1.4 Å and adot density of 10 Å⁻². Volumes were calculated as the sum of the van derWaals volumes of the side chains that were optimized. The number ofburied polar and nonpolar heavy atoms were defined as atoms, with theirattached hydrogens, that expose less that 5 Å² in the surface areacalculation. Electrostatic energies were calculated using a dielectricof one and no cutoff was set for calculation on non-bonded energies.Charge equilibration charges (Rappe & Goddard III, J. Phys. Chem.95:3358 (1991) and Gasteiger (Gasteiger & Marsili, Tetrahedron 36:3219(1980) charges were used to generate electrostatic energies. Chargeequilibration charges were manually adjusted to provide neutralbackbones and neutral side chains in order to prevent spurious monopoleeffects. The selection of properties was limited by the requirement thatproperties could not be highly correlated. Correlated properties cannotbe differentiated by QSAR techniques and only create redundancy in thederived relations.

Genetic function approximation (GFA) was preformed in the CERIUS2simulation package version 1.6 (Biosym/Molecular Simulations, San Diego,Calif.). An initial population of 300 equations was generated consistingof random combinations of three properties. Only linear terms were usedand initial coefficients were determined by least squares regression foreach set of properties. Redundant equations were eliminated and 10000generated of random crossover mutations were performed. If a child hadbetter score than the worst equation in the population, the childreplaced the worse equation. Also, mutation operators that add or removeterms had a 50% probability of being applied each generation, but thesemutations were only accepted if the score was improved. No equation withgreater than three terms was allowed. Equations were scored duringevolution using the lack of fit (LOF) parameter, a scaled least squareerror (LSE) measure that penalized equations with more terms and henceresists overfitting. LOF is defined as:${LOF} = \frac{LSE}{\left( {1 - \frac{2C}{M}} \right)^{2^{\prime}}}$where c is the number of terms in the equation and M is the number ofdata points. Five different randomized runs were done and the finalequation populations were pooled. Only equations containing thesimulation energy, E_(MC), were considered which resulted in 108equations ranked by their LOF.

To assess the predictive power of these QSAR equations, as well as theirrobustness, cross validation analysis was carried out. Each peptide wassequentially removed from the data set and the coefficients of theequation in question were refit. This new equation was then used topredict the withheld data point. When all of the data points had beenpredicted in this manner, their correlation to the measured T_(m)'s wascomputed (Table 3). Only the E_(MC)/ΔA_(np) QSAR and theE_(MC)/ΔA_(np)/ΔA_(p) QSAR performed well in cross validation. TheE_(MC)/ΔA_(np) equation could not be expected to fit the data assmoothly as QSAR's with three terms and hence had a lower crossvalidated r². However, all other two term QSAR's had LOF scores greaterthan 48 and cross validation correlations less than 0.55 (data notshown). The QSAR analysis independently predicted with no subjectivebias that consideration of nonpolar and polar surface area burial isnecessary to improve the simulation. This result is consistent withprevious studies on atomic solvation potentials (Eisenberg, et al.,(1986) (supra); Wesson, et al., Protein Sci. 1:227 (1992)). Further,simpler structural measures, such as number of buried atoms, thatreflect underlying principles such as hydrophobic solvation (Chan, etal., Science 267:1463 (1995)) were not deemed as significant by the QSARanalysis. These results justify the cost of calculating actual surfaceareas, though in some studies simpler potentials have been shown toperform well (van Gunsteren, et al. J. Mol. Biol 227:389 (1992)).

Δa_(np) and ΔA_(p) were introduced into the simulation module to correctthe cost function. Contributions to surface burial from rotamer/templateand rotamer/rotamer contacts were calculated and used in the interactionpotential. Independently counting buried surface from different rotamerpairs, which is necessary in DEE, leads to overestimation of burialbecause the radii of solvent accessible surfaces are much larger thanthe van der Waals contact radii and hence can overlap greatly in a closepacked protein core. To account for this discrepancy, the areas used inthe QSAR were recalculated using the pairwise area method and a newE_(MC)/ΔA_(np)/ΔA_(p) QSAR equation was generated. The ratios of theE_(MC) coefficient to the ΔA_(np) and ΔA_(p) coefficients are scalefactors that are used in the simulation module to convert buried surfacearea into energy, i.e. atomic solvation parameters. Thermal stabilitiesare predicted well by this cost function (FIG. 6B). In addition, theimproved cost function still predicts the naturally occurring GCN4-p1sequence as the ground state. The surface area to energy scale factors,16 cal/mol/Å² favoring nonpolar area burial and 86 cal/mol/Å² opposingpolar area burial, are similar in sign, scale and relative magnitude tosolvation potential parameters derived from small molecule transfer data(Wesson & Eisenberg, supra).

λ repressor mutants: To demonstrate the generality of the cost function,other proteins were examined using the simulation module. A library ofcore mutants of the DNA binding protein λ repressor has been extensivelycharacterized by Sauer and coworkers (Lim & Sauer, J. Mol. Biol. 219:359(1991)). Template coordinates were taken from PDB file 1LMB (Beamer &Pabo, J. Mol. Biol. 227:177 (1992)). The subunit designated chain 4 inthe PDB file was removed from the context of the rest of the structure(accompanying subunit and DNA) and using BIOGRAF explicit hydrogens wereadded. The hydrophobic residues with side chains within 5 Å of the threemutation sites (V36 M40 V47) are Y22, L31, A37, M42, L50, F51, L64, L65,168 and L69. All of these residues are greater than 80% buried exceptfor M42 which is 65% buried and L64 which is 45% buried. A37 only hasone possible rotamer and hence was not optimized. The other nineresidues in the 5 Å sphere were allowed to take any rotamer conformationof their amino acid (“floated”). The mutation sites were allowed anyrotamer of the amino acid sequence in question. Depending on the mutantsequence, 5×10¹⁶ to 7×10¹⁸ conformations were possible. Rotamer energyand DEE calculation times were 2 to 4 minutes. The combined activityscore is that of Hellinga and Richards (Hellinga, et al., (1994)(supra)). Seventy-eight of the 125 possible combinations were generated.Also, this dataset has been used to test several computational schemesand can serve as a basis for comparing different forcefields (Lee &Levitt, Nature 352:448 (1991); van Gunsteren & Mark, supra; Hellinga, etal., (1994) (supra)). The simulation module, using the cost functionfound by QSAR, was used to find the optimal conformation and energy foreach mutant sequence. All hydrophobic residues within 5 Å of the threemutation sites were also left free to be relaxed by the algorithm. This5 Å sphere contained 12 residues, a significantly larger problem thanprevious efforts (Lee & Levitt, supra; Hellinga, (1994) (supra)), thatwere rapidly optimized by the DEE component of the simulation module.The rank correlation of the predicted energy to the combined activityscore proposed by Hellinga and Richards is shown in FIG. 7. The wildtypehas the lowest energy of the 125 possible sequences and the correlationis essentially equivalent to previously published results whichdemonstrates that the QSAR corrected cost function is not specific forcoiled coils and can model other proteins adequately.

Example 2

Automated Design of the Surface Positions of Protein Helices

GCN4-pl, a homodimeric coiled coil, was again selected as the modelsystem because it can be readily synthesized by solid phase techniquesand its helical secondary structure and dimeric tertiary organizationease characterization. The sequences of homodimeric coiled coils displaya seven residue periodic hydrophobic and polar pattern called a heptadrepeat, (a•b•c•d•e•f•g) (Cohen & Parry, supra). The a and d positionsare buried at the dimer interface and are usually hydrophobic, whereasthe b, c, e, f, and g positions are solvent exposed and usually polar(FIG. 5). Examination of the crystal structure of GCN4-p1 (O'Shea, etal., supra) shows that the b, c, and f side chains extend into solventand expose at least 55% of their surface area. In contrast, the e and gresidues bury from 50 to 90% of their surface area by packing againstthe a and d residues of the opposing helix. We selected the 12 b, c, andf residue positions for surface sequence design: positions 3, 4, 7, 10,11, 14, 17, 18, 21, 24, 25, and 28 using the numbering from PDB entry2zta (Bernstein, et al., J. Mol. Biol. 112:535 (1977)). The remainder ofthe protein structure, including all other side chains and the backbone,was used as the template for sequence selection calculations. Thesymmetry of the dimer and lack of interactions of surface residuesbetween the subunits allowed independent design of each subunit, therebysignificantly reducing the size of the sequence optimization problem.

All possible sequences of hydrophilic amino acids (D, E, N, Q, K, R, S,T, A, and H) for the 12 surface positions were screened by our designalgorithm. The torsional flexibility of the amino acid side chains wasaccounted for by considering a discrete set of all allowed conformers ofeach side chain, called rotamers (Ponder, et al., (1987 ((supra);Dunbrack, et al., Struc. Biol. Vol. 1(5):334-340 (1994)). Optimizing the12 b, c, and f positions each with 10 possible amino acids results in10¹² possible sequences which corresponds to ˜10²⁸ rotamer sequenceswhen using the Dunbrack and Karplus backbone-dependent rotamer library.The immense search problem presented by rotamer sequence optimization isovercome by application of the Dead-End Elimination (DEE) theorem(Desmet, et al., (1992 ((supra); Desmet, et al., (1994) (supra);Goldstein, (1994) (supra)). Our implementation of the DEE theoremextends its utility to sequence design and rapidly finds the globallyoptimal sequence in its optimal conformation.

We examined three potential-energy functions for their effectiveness inscoring surface sequences. Each candidate scoring function was used todesign the b, c, and f positions of the model coiled coil and theresulting peptide was synthesized and characterized to assess designperformance. A hydrogen-bond potential was used to check if predictedhydrogen bonds can contribute to designed protein stability, as expectedfrom studies of hydrogen bonding in proteins and peptides (Stickle, etal., supra; Huyghues-Despointes, et al., supra). Optimizing sequencesfor hydrogen bonding, however, often buries polar protons that are notinvolved in hydrogen bonds. This uncompensated loss of potentialhydrogen-bond donors to water prompted examination of a second scoringscheme consisting of a hydrogen-bond potential in conjunction with apenalty for burial of polar protons (Eisenberg, (1986) (supra)). Wetested a third scoring scheme which augments the hydrogen bond potentialwith the empirically derived helix propensities of Baldwin and coworkers(Chakrabartty, et al., supra). Although the physical basis of helixpropensities is unclear, they can have a significant effect on proteinstability and can potentially be used to improve protein designs (O'Neil& DeGrado, 1990; Zhang, et al., Biochem. 30:2012 (1991); Blaber, et al.,Science 260:1637 (1993); O'shea, et al., 1993; Villegas, et al., Foldingand Design 1:29 (1996)). A van der Waals potential was used in all casesto account for packing interactions and excluded volume.

Several other sequences for the b, c and f positions were alsosynthesized and characterized to help discern the relative importance ofthe hydrogen-bonding and helix-propensity potentials. The sequencedesigned with the hydrogen-bond potential was randomly scrambled,thereby disrupting the designed interactions but not changing the helixpropensity of the sequence. Also, the sequence with the maximum possiblehelix propensity, all positions set to alanine, was made. Finally, toserve as undesigned controls, the naturally occurring GCN4-p1 sequenceand a sequence randomly selected from the hydrophilic amino acid setwere synthesized and studied.

Sequence design: Scoring functions and DEE: The protein structure wasmodeled on the backbone coordinates of GCN4-p1, PDB record 2zta(Bernstein, et al., supra; O'Shea, et al., supra). Atoms of all sidechains not optimized were left in their crystallographically determinedpositions. The program BIOGRAF (Molecular Simulations Incorporated, SanDiego, Calif.) was used to generate explicit hydrogens on the structurewhich was then conjugate gradient minimized for 50 steps using theDREIDING forcefield (Mayo, et al., 1990, supra). The symmetry of thedimer and lack of interactions of surface residues between the subunitsallowed independent design of each subunit. All computations were doneusing the first monomer to appear in 2zta (chain A). Abackbone-dependent rotamer library was used (Dunbrack, et al., (1993)(supra)). c₃ angles that were undetermined from the database statisticswere assigned the following values: Arg, −60°, 60°, and 180°; Gln,−120°, −60°, 0°, 60°, 120°, and 180°; Glu, 0°, 60°, and 120°; Lys, −60°,60°, and 180°. c₄ angles that were undetermined from the databasestatistics were assigned the following values: Arg, −120°, −60°, 60°,120°, and 180°; Lys, −60°, 60°, and 180°. Rotamers with combinations ofc₃ and c₄ that resulted in sequential g⁺/g⁻ or g⁻/g⁺ angles wereeliminated. Uncharged His rotamers were used. A Lennard-Jones 12-6potential with van der Waals radii scaled by 0.9 (Dahiyat, et al., Firstfully automatic design of a protein achieved by Caltech scientists, newpress release (1997) was used for van der Waals interactions. Thehydrogen bond potential consisted of a distance-dependent term and anangle-dependent term, as depicted in Equation 9, above. This hydrogenbond potential is based on the potential used in DREIDING, with morerestrictive angle-dependent terms to limit the occurrence of unfavorablehydrogen bond geometries. The angle term varies depending on thehybridization state of the donor and acceptor, as shown in Equations 10to 13, above.

In Equations 10-13, θ is the donor-hydrogen-acceptor angle, Φ is thehydrogen-acceptor-base angle (the base is the atom attached to theacceptor, for example the carbonyl carbon is the base for a carbonyloxygen acceptor), and φ is the angle between the normals of the planesdefined by the six atoms attached to the sp² centers (the supplement ofφ is used when φ is less than 90°). The hydrogen-bond function is onlyevaluated when 2.6 Å<R<3.2 Å, Φ>90°, f−109.5°<90° for the sp³ donor-sp³acceptor case, and, Φ>90° for the sp³ donor-sp² acceptor case; noswitching functions were used. Template donors and acceptors that wereinvolved in template-template hydrogen bonds were not included in thedonor and acceptor lists. For the purpose of exclusion, atemplate-template hydrogen bond was considered to exist when 2.5 Å≧R≧3.3Å and θ≧135°. A penalty of 2 kcal/mol for polar hydrogen burial, whenused, was only applied to buried polar hydrogens not involved inhydrogen bonds, where a hydrogen bond was considered to exist whenE_(HB) was less than −2 kcal/mol. This penalty was not applied totemplate hydrogens. The hydrogen-bond potential was also supplementedwith a weak coulombic term that included a distance-dependent dielectricconstant of 40R, where R is the interatomic distance. Partial atomiccharges were only applied to polar functional groups. A net formalcharge of +1 was used for Arg and Lys and a net formal charge of −1 wasused for Asp and Glu. Energies associated with α-helical propensitieswere calculated using equation 14, above. In Equation 14, E_(α) is theenergy of α-helical propensity, ΔG°_(aa) is the standard free energy ofhelix propagation of the amino acid, and ΔG°_(ala) is the standard freeenergy of helix propagation of alanine used as a standard, and N_(ss) isthe propensity scale factor which was set to 3.0. This potential wasselected in order to scale the propensity energies to a similar range asthe other terms in the scoring function. The DEE optimization followedthe methods of our previous work (Dahiyat, et al., (1996) (supra)).Calculations were performed on either a 12 processor, R10000-basedSilicon Graphics Power Challenge or a 512 node Intel Delta.

Peptide synthesis and purification and CD analysis was as in Example 1.NMR samples were prepared in 90/10H₂O/D₂O and 50 mM sodium phosphatebuffer at pH 7.0. Spectra were acquired on a Varian Unityplus 600 MHzspectrometer at 25° C. 32 transients were acquired with 1.5 seconds ofsolvent presaturation used for water suppression. Samples were ˜1 mM.Size exclusion chromatography was performed with a PolyLC hydroxyethyl Acolumn (20 cm×9 mm) at pH 7.0 in 50 mM phosphate and 150 mM NaCl at 0°C. GCN4-p1 and p-LI (Harbury, et al., supra) were used as size standardsfor dimer and tetramer, respectively. 5 μl injections of ˜1 mM peptidesolution were chromatographed at 0.50 ml/min and monitored at 214 nm.Samples were run in triplicate.

The surface sequences of all of the peptides examined in this study areshown in Table 4. TABLE 4 Sequences and properties of the synthesizedpeptides Surface Sequence ΣΔG° N Peptide Design method bcf bcf bcf bcfT_(m) (° C.) (kcal/mol) GCN4-p1 (SEQ ID NO:31) none KQD EES YHN ARK 573.831 2 6A (SEQ ID NO:32) HB EKD RER RRE RRE 71 2.193 2 6B (SEQ ID NO33) HB + PB EKQ KER ERE ERQ 72 2.868 2 6C (SEQ ID NO:34) HB + HP ARA AAARRR ARA 69 −2.041 2 6D (SEQ ID NO:35) scrambled HB REE RRR EDR KRE 712.193 2 6E (SEQ ID NO:36) random polar NTR AKS ANH NTQ 15 4.954 2 6F(SEQ ID NO:37) poly(Ala) AAA AAA AAA AAA 73 −3.096 4For clarity only the designed surface residues are shown and they aregrouped by position (b, c, and f).The sequence numbers of the designed positions are: 3, 4, 7, 10, 11, 14,17, 18, 21, 24, 25, and 28.Melting temperatures (T_(m)'s) were determined by circular dichroism andoligomerization states (N) were determined by size exclusionchromatography.ΣΔG° is the sum of the standard free energy of helix propagation of the12 b, c, and f positions (Chakrabartty, et al., 1994).Abbreviations for design methods are: hydrogen bonds (HB), polarhydrogen burial penalty (PB), and helix propensity (HP).

Sequence 6A (SEQ ID NO:32), designed with a hydrogen-bond potential, hasa preponderance of Arg and Glu residues that are predicted to formnumerous hydrogen bonds to each other. These long chain amino acids arefavored because they can extend across turns of the helix to interactwith each other and with the backbone. When the optimal geometry of thescrambled 6A sequence, 6D (SEQ ID NO:35), was found with DEE, far fewerhydrogen bonding interactions were present and its score was much worsethan 6A's (SEQ ID NO:32). 6B (SEQ ID NO:33), designed with a polarhydrogen burial penalty in addition to a hydrogen-bond potential, isstill dominated by long residues such as Lys, Glu and Gin but has fewerArg. Because Arg has more polar hydrogens than the other amino acids, itmore often buries nonhydrogen-bonded protons and therefore is disfavoredwhen using this potential function. 6C (SEQ ID NO:34) was designed witha hydrogen-bond potential and helix propensity in the scoring functionand consists entirely of Ala and Arg residues, the amino acids with thehighest helix propensities (Chakrabartty, et al., supra). The Argresidues form hydrogen bonds with Glu residues at nearby e and gpositions. The random hydrophilic sequence, 6E (SEQ ID NO:36), possessesno hydrogen bonds and scores very poorly with all of the potentialfunctions used.

The secondary structures and thermal stabilities of the peptides wereassessed by circular dichroism (CD) spectroscopy. The CD spectra of thepeptides at 1° C. and 40 μM are characteristic of a helices, with minimaat 208 and 222 nm, except for the random surface sequence peptide 6E(SEQ ID NO:36). 6E has a spectrum suggestive of a mixture of α helix andrandom coil with a [θ]₂₂₂ of −12000 deg cm²/dmol, while all the otherpeptides are greater than 90% helical with [θ]₂₂₂ of less than −30000deg cm²/dmol. The melting temperatures (T_(m)'s) of the designedpeptides are 12-16° C. higher than the T_(m) of GCN4-p1 (SEQ ID NO:31),with the exception of 6E (SEQ ID NO:36) which has a T_(m) of 15° C. CDspectra taken before and after melts were identical indicatingreversible thermal denaturation. The redesign of surface positions ofthis coiled coil produces structures that are much more stable thanwildtype GCN4-p1 (SEQ ID NO:31), while a random hydrophilic sequencelargely disrupts the peptide's stability.

Size exclusion chromatography (SEC) showed that all the peptides weredimers except for 6F (SEQ ID NO:37), the all Ala surface sequence, whichmigrated as a tetramer. These data show that surface redesign did notchange the tertiary structure of these peptides, in contrast to somecore redesigns (Harbury, et al., supra). In addition, nuclear magneticresonance (NMR) spectra of the peptides at ˜1 mM showed chemical shiftdispersion similar to GCN4-p1 (SEQ ID NO:31) (data not shown).

Peptide 6A (SEQ ID NO:32), designed with a hydrogen-bond potential,melts at 71° C. versus 57° C. for GCN4-p1 (SEQ ID NO:31), demonstratingthat rational design of surface residues can produce structures that aremarkedly more stable than naturally occurring coiled coils. This gain instability is probably not due to improved hydrogen bonding since 6D (SEQID NO:35), which has the same surface amino acid composition as 6A (SEQID NO:32) but a scrambled sequence and no predicted hydrogen bonds, alsomelts at 71° C. Further, 6B (SEQ ID NO:33) was designed with a differentscoring function and has a different sequence and set of predictedhydrogen bonds but a very similar T_(m) of 72° C.

An alternative explanation for the increased stability of thesesequences relative to GCN4-p1 (SEQ ID NO:31) is their higher helixpropensity. The long polar residues selected by the hydrogen bondpotential, Lys, Glu, Arg and Gln, are also among the best helix formers(Chakrabartty, et al., supra). Since the effect of helix propensity isnot as dependent on sequence position as that of hydrogen bonding,especially far from the helix ends, little effect would be expected fromscrambling the sequence of 6A (SEQ ID NO:32). A rough measure of thehelix propensity of the surface sequences, the sum of the standard freeenergies of helix propagation (ΣΔG°) (Chakrabartty, et al., supra),corresponds to the peptides' thermal stabilities (Table 4). Though ΣΔG°matches the trend in peptide stability, it is not quantitativelycorrelated to the increased stability of these coiled coils.

Peptide 6C (SEQ ID NO:34) was designed with helix propensity as part ofthe scoring function and it has a ΣΔG° of −2.041 kcal/mol. Though 6C(SEQ ID NO:34) is more stable than GCN4-p1 (SEQ ID NO:31), its T_(m) of69° C. is slightly lower than 6A (SEQ ID NO:32) and 6B (SEQ ID NO:33),in spite of 6C's (SEQ ID NO: 34) higher helix propensity. Similarly, 6F(SEQ ID NO:37) has the highest helix propensity possible with an all Alasequence and a ΣΔG° of −3.096 kcal/mol, but its T_(m) of 73° C. is onlymarginally higher than that of 6A (SEQ ID NO:32) or 6B (SEQ ID NO:33).6F (SEQ ID NO:37) also migrates as a tetramer during SEC, not a dimer,likely because its poly(Ala) surface exposes a large hydrophobic patchthat could mediate association. Though the results for 6C (SEQ ID NO:34)and 6F (SEQ ID NO:37) support the conclusion that helix propensity isimportant for surface design, they point out possible limitations inusing propensity exclusively. Increasing propensity does not necessarilyconfer the greatest stability on a structure, perhaps because otherfactors are being effected unfavorably. Also, as is evident from 6F (SEQID NO:37), changes in the tertiary structure of the protein can occur.

The characterization of these peptides clearly shows that surfaceresidues have a dramatic impact on the stability of α-helical coiledcoils. The wide range of stabilities displayed by the different surfacedesigns is notable, with greater than a 50° C. spread between the randomhydrophilic sequence (T_(m) 15° C.) and the designed sequences (T_(m)69-72° C.). This result is consistent with studies on other proteinsthat demonstrated the importance of solvent exposed residues (O'Neil &DeGrado, 1990; Zhang, et al., 1991; Minor, et al., (1994) (supra);Smith, et al., Science 270:980-982 (1995)). Further, these designs havesignificantly higher T_(m)'s than the wildtype GCN4-p1 (SEQ ID NO:31)sequence, demonstrating that surface residues can be used to improvestability in protein design (O'shea, et al., supra). Though helixpropensity appears to be more important than hydrogen bonding instabilizing the designed coiled coils, hydrogen bonding could beimportant in the design and stabilization of other types of secondarystructure.

Example 3

Design of a Protein Containing Core, Surface and Boundary Residues Usingvan der Waals, H-Bonding, Secondary Structure and Solvation ScoringFunctions

In this example, core, boundary and surface residue work was combined.In selecting a motif to test the integration of our designmethodologies, we sought a protein fold that would be small enough to beboth computationally and experimentally tractable, yet large enough toform an independently folded structure in the absence of disulfide bondsor metal binding sites. We chose the ββα motif typified by the zincfinger DNA binding module (Pavletich, et al. (1991) (supra)). Though itconsists of less than 30 residues, this motif contains sheet, helix, andturn structures. Further, recent work by Imperiali and coworkers whodesigned a 23 residue peptide, containing an unusual amino acid(D-proline) and a non-natural amino acid(3-(1,10-phenanthrol-2-yl)-L-alanine), that takes this structure hasdemonstrated the ability of this fold to form in the absence of metalions (Struthers, et al., 1996a). The Brookhaven Protein Data Bank (PDB)(Bernstein, et al., 1977) was examined for high resolution structures ofthe ββα motif, and the second zinc finger module of the DNA bindingprotein Zif268 (SEQ ID NO:1) (PDB code 1zaa) was selected as our designtemplate (Pavletich, et al. (1991) (supra)). The backbone of the secondmodule aligns very closely with the other two zinc fingers in Zif268(SEQ ID NO:1) and with zinc fingers in other proteins and is thereforerepresentative of this fold class. 28 residues were taken from thecrystal structure starting at lysine 33 in the numbering of PDB entry1zaa which corresponds to our position 1. The first 12 residues comprisethe β sheet with a tight turn at the 6^(th) and 7^(th) positions. Tworesidues connect the sheet to the helix, which extends through position26 and is capped by the last two residues.

In order to assign the residue positions in the template structure intocore, surface or boundary classes, the extent of side-chain burial inZif268 (SEQ ID NO:1) and the direction of the Cα-Cβ vectors wereexamined. The small size of this motif limits to one (position 5) thenumber of residues that can be assigned unambiguously to the core whilesix residues (positions 3, 12, 18, 21, 22, and 25) were classified asboundary. Three of these residues are from the sheet (positions 3, 5,and 12) and four are from the helix (positions 18, 21, 22, and 25). Oneof the zinc binding residues of Zif268 (SEQ ID NO:1) is in the core andtwo are in the boundary, but the fourth, position 8, has a Cα-Cβ vectordirected away from the protein's geometric center and is thereforeclassified as a surface position. The other surface positions consideredby the design algorithm are 4, 9, and 11 from the sheet, 15, 16, 17, 19,20, and 23 from the helix and 14, 27, and 28 which cap the helix ends.The remaining exposed positions, which either were in turns, hadirregular backbone dihedrals or were partially buried, were not includedin the sequence selection for this initial study. As in our previousstudies, the amino acids considered at the core positions duringsequence selection were A, V, L, I, F, Y, and W; the amino acidsconsidered at the surface positions were A, S, T, H, D, N, E, Q, K, andR; and the combined core and surface amino acid sets (16 amino acids)were considered at the boundary positions.

In total, 20 out of 28 positions of the template were optimized duringsequence selection. The algorithm first selects Gly for all positionswith φ angles greater than 0° in order to minimize backbone strain(residues 9 and 27). The 18 remaining residues were split into two setsand optimized separately to speed the calculation. One set contained the1 core, the 6 boundary positions and position 8 which resulted in1.2×10⁹ possible amino acid sequences corresponding to 4.3×10¹⁹ rotamersequences. The other set contained the remaining 10 surface residueswhich had 10¹⁰ possible amino acid sequences and 4.1×10²³ rotamersequences. The two groups do not interact strongly with each othermaking their sequence optimizations mutually independent, though thereare strong interactions within each group. Each optimization was carriedout with the non-optimized positions in the template set to thecrystallographic coordinates.

The optimal sequences found from the two calculations were combined andare shown in FIG. 8 (SEQ ID NO:1 and SEQ ID NO:2) aligned with thesequence from the second zinc finger of Zif268 (SEQ ID NO:1). Eventhough all of the hydrophilic amino acids were considered at each of theboundary positions, only nonpolar amino acids were selected. Thecalculated seven core and boundary positions form a well-packed buriedcluster. The Phe side chains selected by the algorithm at the zincbinding His positions, 21 and 25, are 80% buried and the Ala at 5 is100% buried while the Lys at 8 is greater than 60% exposed to solvent.The other boundary positions demonstrate the strong steric constraintson buried residues by packing similar side chains in an arrangementsimilar to Zif268. The calculated optimal configuration buried ˜830 Å²of nonpolar surface area, with Phe 12 (96% buried) and Leu 18 (88%buried) anchoring the cluster. On the helix surface, the algorithmpositions Asn 14 as a helix N-cap with a hydrogen bond between itsside-chain carbonyl oxygen and the backbone amide proton of residue 16.The six charged residues on the helix form three pairs of hydrogenbonds, though in our coiled coil designs helical surface hydrogen bondsappeared to be less important than the overall helix propensity of thesequence. Positions 4 and 11 on the exposed sheet surface were selectedto be Thr, one of the best β-sheet forming residues (Kim & Berg, 1993;Minor, et al., (1994) (supra); Smith, et al., (1995) (supra)).

Combining the 20 designed positions with the Zif268 (SEQ ID NO:1) aminoacids at the remaining 8 sites results in a peptide with overall 39%(11/28) homology to Zif268 (SEQ ID NO:1), which reduces to 15% (3/20)homology when only the designed positions are considered. A BLAST(Altschul, et al., 1990) search of the non-redundant protein sequencedatabase of the National Center for Biotechnology Information finds weakhomology, less than 40%, to several zinc finger proteins and fragmentsof other unrelated proteins. None of the alignments had significancevalues less than 0.26. By objectively selecting 20 out of 28 residues onthe Zif268 (SEQ ID NO:1) template, a peptide with little homology toknown proteins and no zinc binding site was designed.

Experimental characterization: The far UV circular dichroism (CD)spectrum of the designed molecule, pda8d (SEQ ID NO:2), shows a maximumat 195 nm and minima at 218 nm and 208 nm, which is indicative of afolded structure. The thermal melt is weakly cooperative, with aninflection point at 39° C., and is completely reversible. The broad meltis consistent with a low enthalpy of folding which is expected for amotif with a small hydrophobic core. This behavior contrasts theuncooperative transitions observed for other short peptides (Weiss &Keutmann, 1990; Scholtz, et al., PNAS USA 88:2854 (1991); Struthers, etal., J. Am. Chem. Soc. 118:3073 (1996b)).

Sedimentation equilibrium studies at 100 μM and both 7° C. and 25° C.give a molecular mass of 3490, in good agreement with the calculatedmass of 3362, indicating the peptide is monomeric. At concentrationsgreater than 500 μM, however, the data do not fit well to an idealsingle species model. When the data were fit to a monomer-dimer-tetramermodel, dissociation constants of 0.5-1.5 mM for monomer-to-dimer andgreater than 4 mM for dimer-to-tetramer were found, though theinteraction was too weak to accurately measure these values. Diffusioncoefficient measurements using the water-sLED pulse sequence (Altieri,et al., 1995) agreed with the sedimentation results: at 100 μM pda8d hasa diffusion coefficient close to that of a monomeric zinc fingercontrol, while at 1.5 mM the diffusion coefficient is similar to that ofprotein G β1, a 56 residue protein. The CD spectrum of pda8d (SEQ IDNO:2) is concentration independent from 10 μM to 2.6 mM. NMR COSYspectra taken at 2.1 mM and 100 μM were almost identical with 5 of theHα-HN crosspeaks shifted no more than 0.1 ppm and the rest of thecrosspeaks remaining unchanged. These data indicate that pda8d undergoesa weak association at high concentration, but this association hasessentially no effect on the peptide's structure.

The NMR chemical shifts of pda8d are well dispersed, suggesting that theprotein is folded and well-ordered. The Hα-HN fingerprint region of theTOCSY spectrum is well-resolved with no overlapping resonances (FIG.(9A) and all of the Hα and HN resonances have been assigned. NMR datawere collected on a Varian Unityplus 600 MHz spectrometer equipped witha Nalorac inverse probe with a self-shielded z-gradient. NMR sampleswere prepared in 90/10H₂O/D₂O or 99.9% D₂O with 50 mM sodium phosphateat pH 5.0. Sample pH was adjusted using a glass electrode with nocorrection for the effect of D₂O on measured pH. All spectra forassignments were collected at 7° C. Sample concentration wasapproximately 2 mM. NMR assignments were based on standard homonuclearmethods using DQF-COSY, NOESY and TOCSY spectra (Wuthrich, NMR ofProteins and Nucleic Acids (John Wiley & Sons, New York, 1986). NOESYand TOCSY spectra were acquired with 2K points in F2 and 512 incrementsin F1 and DQF-COSY spectra were acquired with 4K points in F2 and 1024increments in F1. All spectra were acquired with a spectral width of7500 Hz and 32 transients. NOESY spectra were recorded with mixing timesof 100 and 200 ms and TOCSY spectra were recorded with an isotropicmixing time of 80 ms. In TOCSY and DQF-COSY spectra water suppressionwas achieved by presaturation during a relaxation delay of 1.5 and 2.0s, respectively. Water suppression in the NOESY spectra was accomplishedwith the WATERGATE pulse sequence (Piotto, et al., 1992). Chemicalshifts were referenced to the HOD resonance. Spectra were zero-filled inboth F2 and F1 and apodized with a shifted gaussian in F2 and a cosinebell in F1 (NOESY and TOCSY) or a 30° shifted sine bell in F2 and ashifted gaussian in F1 (DQF-COSY).

Water-sLED experiments (Altieri, et al., 1995) were run at 25° C. at 1.5mM, 400 μM and 100 μM in 99.9% D₂O with 50 mM sodium phosphate at pH5.0. Axial gradient field strength was varied from 3.26 to 53.1 G/cm anda diffusion time of 50 ms was used. Spectra were processed with 2 Hzline broadening and integrals of the aromatic and high field aliphaticprotons were calculated and fit to an equation relating resonanceamplitude to gradient strength in order to extract diffusioncoefficients (Altieri, et al., 1995). Diffusion coefficients were1.48×10⁻⁷, 1.62×10⁻⁷ and 1.73×10⁻⁷ cm²/s at 1.5 mM, 400 μM and 100 μM,respectively. The diffusion coefficient for the zinc finger monomercontrol was 1.72×10⁻⁷ cm²/s and for protein G b1 was 1.49×10⁻⁷ cm²/s.

All unambiguous sequential and medium-range NOEs are shown in FIG. 9A.Hα-HN and/or HN-HN NOEs were found for all pairs of residues exceptR6-17 and K16-E17, both of which have degenerate HN chemical shifts, andP2-Y3 which have degenerate Hα chemical shifts. An NOE is present,however, from a P2Hδ to the Y3 HN analogous to sequential HN-HNconnections. Also, strong K1 Hα to P2Hδ NOEs are present and allowedcompletion of the resonance assignments.

The structure of pda8d (SEQ ID NO:2) as determined using 354 NOErestraints (12.6 restraints per residue) that were non-redundant withcovalent structure. An ensemble of 32 structures (data not shown) wasobtained using X-PLOR (Brunger, 1992) with standard protocols for hybriddistance geometry-simulated annealing. The structures in the ensemblehad good covalent geometry and no NOE restraint violations greater than0.3 Å. As shown in Table 5, the backbone was well defined with aroot-mean-square (rms) deviation from the mean of 0.55 Å when thedisordered termini (residues 1, 2, 27, and 28) were excluded. The rmsdeviation for the backbone (3-26) plus the buried side chains (residues3, 5, 7, 12, 18, 21, 22, and 25) was 1.05 Å.

Table 5. NMR structure determination of pda8d (SEQ ID NO:2): distancerestraints, structural statistics, atomic root-mean-square (rms)deviations, and comparison to the design target. <SA> are the 32simulated annealing structures, SA is the average structure and SD isthe standard deviation. The design target is the backbone of Zif268 (SEQID NO:1). Distance restraints Intraresidue 148 Sequential 94 Short range(|i-j| = 2-5 residues) 78 Long range (|i-j| > 5 residues) 34 Total 354Structural statistics <SA> ± SD Rms deviation from distance restraints(Å) 0.049 ± .004  Rms deviation from idealized geometry (Å) Bonds (Å)0.0051 ± 0.0004 Angles (degrees) 0.76 ± 0.04 Impropers (degrees) 0.56 ±0.04 Atomic rms deviations (Å)* <SA> vs. SA ± SD Backbone 0.55 ± 0.03Backbone + nonpolar side chains 1.05 ± 0.06 Heavy atoms 1.25 ± 0.04Atomic rms deviations between pda8d and the design target (Å)* SA vs.target Backbone 1.04 Heavy atoms 2.15*Atomic rms deviations are for residues 3 to 26, inclusive. The termini,residues 1, 2, 27, and 28, were highly disordered and had very fewnon-sequential or non-intraresidue contacts.

The NMR solution structure of pda8d (SEQ ID NO:2) shows that it foldsinto a bba motif with well-defined secondary structure elements andtertiary organization which match the design target. A direct comparisonof the design template, the backbone of the second zinc finger of Zif268(SEQ ID NO:1), to the pda8d solution structure highlights theirsimilarity (data not shown). Alignment of the pda8d (SEQ ID NO:2)backbone to the design target is excellent, with an atomic rms deviationof 1.04 Å (Table 5). Pda8d (SEQ ID NO:2) and the design targetcorrespond throughout their entire structures, including the turnsconnecting the secondary structure elements.

In conclusion, the experimental characterization of pda8d (SEQ ID NO:2)shows that it is folded and well-ordered with a weakly cooperativethermal transition, and that its structure is an excellent match to thedesign target. To our knowledge, pda8d (SEQ ID NO:2) is the shortestsequence of naturally occurring amino acids that folds to a uniquestructure without metal binding, oligomerization or disulfide bondformation (McKnight, et al., Nature Struc. Biol. 4:180 (1996)). Thesuccessful design of pda8d (SEQ ID NO:2) supports the use of objective,quantitative sequence selection algorithms for protein design. Thisrobustness suggests that the program can be used to design sequences forde novo backbones.

Example 4

Protein Design Using a Scaled van der Waals Scoring Function in the CoreRegion

An ideal model system to study core packing is the β1immunoglobulin-binding domain of streptococcal protein G (Gβ1) (SEQ IDNO:38) (Gronenborn, et al., Science 253:657 (1991); Alexander, et al.,Biochem. 31: 3597 (1992); Barchi, et al., Protein Sci. 3:15 (1994);Gallagher, et al., 1994; Kuszewski, et al., 1994; Orban, et al., 1995).Its small size, 56 residues, renders computations and experimentstractable. Perhaps most critical for a core packing study, Gβ1 (SEQ IDNO:38) contains no disulfide bonds and does not require a cofactor ormetal ion to fold. Further, Gβ1 (SEQ ID NO:38) contains sheet, helix andturn structures and is without the repetitive side-chain packingpatterns found in coiled coils or some helical bundles. This lack ofperiodicity reduces the bias from a particular secondary or tertiarystructure and necessitates the use of an objective side-chain selectionprogram to examine packing effects.

Sequence positions that constitute the core were chosen by examining theside-chain solvent accessible surface area of Gβ1 (SEQ ID NO:38). Anyside chain exposing less than 10% of its surface was considered buried.Eleven residues meet this criteria, with seven from the β sheet(positions 3, 5, 7, 20, 43, 52 and 54), three from the helix (positions26, 30, and 34) and one in an irregular secondary structure (position39). These positions form a contiguous core. The remainder of theprotein structure, including all other side chains and the backbone, wasused as the template for sequence selection calculations at the elevencore positions.

All possible core sequences consisting of alanine, valine, leucine,isoleucine, phenylalanine, tyrosine or tryptophan (A, V, L, I, F, Y orW) were considered. Our rotamer library was similar to that used byDesmet and coworkers (Desmet, et al., (1992) (supra)). Optimizing thesequence of the core of Gβ1 (SEQ ID NO:38) with 217 possible hydrophobicrotamers at all 11 positions results in 217¹¹, or 5×10²⁵, rotamersequences. Our scoring function consisted of two components: a van derWaals energy term and an atomic solvation term favoring burial ofhydrophobic surface area. The van der Waals radii of all atoms in thesimulation were scaled by a factor α (Eqn. 3) to change the importanceof packing effects. Radii were not scaled for the buried surface areacalculations. By predicting core sequences with various radii scalingsand then experimentally characterizing the resulting proteins, arigorous study of the importance of packing effects on protein design ispossible.

The protein structure was modeled on the backbone coordinates of Gβ1(SEQ ID NO:38), PDB record 1pga (Bernstein, et al., supra; Gallagher, etal., 1994). Atoms of all side chains not optimized were left in theircrystallographically determined positions. The program BIOGRAF(Molecular Simulations Incorporated, San Diego, Calif.) was used togenerate explicit hydrogens on the structure which was then conjugategradient minimized for 50 steps using the Dreiding forcefield (Mayo, etal., 1990, supra). The rotamer library, DEE optimization and Monte Carlosearch was as outlined above. A Lennard-Jones 12-6 potential was usedfor van der Waals interactions, with atomic radii scaled for the variouscases as discussed herein. The Richards definition of solvent-accessiblesurface area (Lee & Richards, supra) was used and areas were calculatedwith the Connolly algorithm (Connolly, (1983) (supra)). An atomicsalvation parameter, derived from our previous work, of 23 cal/mol/Å²was used to favor hydrophobic burial and to penalize solvent exposure.To calculate side-chain nonpolar exposure in our optimization framework,we first consider the total hydrophobic area exposed by a rotamer inisolation. This exposure is decreased by the area buried inrotamer/template contacts, and the sum of the areas buried in pairwiserotamer/rotamer contacts.

Global optimum sequences for various values of the radius scaling factorα were found using the Dead-End Elimination theorem (Table 6) (SEQ IDNO:38 to SEQ ID NO:48). Optimal sequences, and their correspondingproteins, are named by the radius scale factor used in their design. Forexample, the sequence designed with a radius scale factor of α=0.90 iscalled α90 (SEQ ID NO:43). TABLE 6 Gβ1 sequence TYR LEU LEU ALA ALA PHEALA VAL TRP PHE VAL α vol 3 5 7 20 26 30 34 39 43 52 54 0.70 (SEQ IDNO:39) 1.28 TRP TYR ILE ILE PHE TRP LEU ILE PHE LEU ILE 0.75 (SEQ IDNO:40) 1.23 PHE ILE PHE ILE VAL TRP VAL LEU | | ILE 0.80 (SEQ ID NO:41)1.13 PHE | ILE | | | ILE ILE | TRP ILE 0.85 (SEQ ID NO:42) 1.15 PHE |ILE | | | LEU ILE | TRP PHE 0.90 (SEQ ID NO:43) 1.01 PHE | ILE | | | |ILE | | | 0.95 (SEQ ID NO:43) 1.01 PHE | ILE | | | | ILE | | | 1.0 (SEQID NO:44) 0.99 PHE | VAL | | | | ILE | | | 1.05 (SEQ ID NO:45) 0.93 PHE| ALA | | | | | | | | 1.075 (SEQ ID NO:46) 0.83 ALA ALA ILE | | ILE | || ILE ILE 1.10 (SEQ ID NO:47) 0.77 ALA | ALA | | ALA | | | ILE ILE 1.15(SEQ ID NO:48) 0.68 ALA ALA ALA | | ALA | | | LEU |

In Table 6 (SEQ ID NO:38 to SEQ ID NO:48), the Gβ1 (SEQ ID NO:38)sequence and position numbers are shown at the top. vol is the fractonof core side-chain volume relative to the Gβ1 (SEQ ID NO:38) sequence. Avertical bar indicates identity with the Gβ1 (SEQ ID NO:38) sequence.

α100 (SEQ ID NO:44) was designed with α=1.0 and hence serves as abaseline for full incorporation of steric effects. The α100 (SEQ IDNO:44) sequence is very similar to the core sequence of Gβ1 (SEQ IDNO:38) (Table 6) even though no information about the naturallyoccurring sequence was used in the side-chain selection algorithm.Variation of α from 0.90 to 1.05 caused little change in the optimalsequence, demonstrating the algorithm's robustness to minor parameterperturbations. Further, the packing arrangements predicted withα=0.90-1.05 closely match Gβ1 (SEQ ID NO:38) with average χ angledifferences of only 4° from the crystal structure. The high identity andconformational similarity to Gβ1 (SEQ ID NO:38) imply that, when packingconstraints are used, backbone conformation strongly determines a singlefamily of well packed core designs. Nevertheless, the constraints oncore packing were being modulated by α as demonstrated by Monte Carlosearches for other low energy sequences. Several alternate sequences andpacking arrangements are in the twenty best sequences found by the MonteCarlo procedure when α=0.90. These alternate sequences score much worsewhen α=0.95, and when α=1.0 or 1.05 only strictly conservative packinggeometries have low energies. Therefore, α=1.05 and α=0.90 define thehigh and low ends, respectively, of a range where packing specificitydominates sequence design.

For α<0.90, the role of packing is reduced enough to let the hydrophobicsurface potential begin to dominate, thereby increasing the size of theresidues selected for the core (Table 6) (SEQ ID NO:38 to SEQ ID NO:48).A significant change in the optimal sequence appears between α=0.90 and0.85 with both α85 (SEQ ID NO:42) and α80 (SEQ ID NO:41) containingthree additional mutations relative to α90 (SEQ ID NO:43). Also, α85(SEQ ID NO:42) and α80 (SEQ ID NO:41) have a 15% increase in totalside-chain volume relative to Gβ1 (SEQ ID NO:38). As a drops below 0.80an additional 10% increase in side-chain volume and numerous mutationsoccur, showing that packing constraints have been overwhelmed by thedrive to bury nonpolar surface. Though the jumps in volume and shifts inpacking arrangement appear to occur suddenly for the optimal sequences,examination of the suboptimal low energy sequences by Monte Carlosampling demonstrates that the changes are not abrupt. For example, theα85 optimal sequence (SEQ ID NO:42) is the 11^(th) best sequence whenα=0.90, and similarly, the α90 optimal sequence (SEQ ID NO:43) is the9^(th) best sequence when α=0.85.

For a>1.05 atomic van der Waals repulsions are so severe that most aminoacids cannot find any allowed packing arrangements, resulting in theselection of alanine for many positions. This stringency is likely anartifact of the large atomic radii and does not reflect increasedpacking specificity accurately. Rather, a=1.05 is the upper limit forthe usable range of van der Waals scales within our modeling framework.

Experimental characterization of core designs. Variation of the van derWaals scale factor a results in four regimes of packing specificity:regime 1 where 0.9≦α≦1.05 and packing constraints dominate the sequenceselection; regime 2 where 0.8≦α≦0.9 and the hydrophobic solvationpotential begins to compete with packing forces; regime 3 where α<0.8and hydrophobic salvation dominates the design; and, regime 4 whereα>1.05 and van der Waals repulsions appear to be too severe to allowmeaningful sequence selection. Sequences that are optimal designs wereselected from each of the regimes for synthesis and characterization.They are α90 (SEQ ID NO:43) from regime 1, α85 (SEQ ID NO: 42) fromregime 2, α70 (SEQ ID NO:39) from regime 3 and α107 (SEQ ID NO:46) fromregime 4. For each of these sequences, the calculated amino acididentities of the eleven core positions are shown in Table 6; theremainder of the protein sequence matches Gβ1 (SEQ ID NO:38). The goalwas to study the relation between the degree of packing specificity usedin the core design and the extent of native-like character in theresulting proteins.

Peptide synthesis and purification. With the exception of the elevencore positions designed by the sequence selection algorithm, thesequences synthesized match Protein Data Bank entry 1pga. Peptides weresynthesized using standard Fmoc chemistry, and were purified byreverse-phase HPLC. Matrix assisted laser desorption mass spectrometryfound all molecular weights to be within one unit of the expectedmasses.

CD and fluorescence spectroscopy and size exclusion chromatography. Thesolution conditions for all experiments were 50 mM sodium phosphatebuffer at pH 5.5 and 25° C. unless noted. Circular dichroism spectrawere acquired on an Aviv 62DS spectrometer equipped with athermoelectric unit. Peptide concentration was approximately 20 μM.Thermal melts were monitored at 218 nm using 2° increments with anequilibration time of 120 s. T_(m)'s were defined as the maxima of thederivative of the melting curve. Reversibility for each of the proteinswas confirmed by comparing room temperature CD spectra from before andafter heating. Guanidinium chloride denaturation measurements followedpublished methods (Pace, Methods. Enzymol. 131:266 (1986)). Proteinconcentrations were determined by UV spectrophotometry. Fluorescenceexperiments were performed on a Hitachi F-4500 in a 1 cm pathlengthcell. Both peptide and ANS concentrations were 50 μM. The excitationwavelength was 370 nm and emission was monitored from 400 to 600 nm.Size exclusion chromatography was performed with a PolyLC hydroxyethyl Acolumn at pH 5.5 in 50 mM sodium phosphate at 0° C. Ribonuclease A,carbonic anhydrase and Gβ1 (SEQ ID NO:38) were used as molecular weightstandards. Peptide concentrations during the separation were ˜15 μM asestimated from peak heights monitored at 275 nm.

Nuclear magnetic resonance spectroscopy. Samples were prepared in90/10H₂O/D₂O and 50 mM sodium phosphate buffer at pH 5.5. Spectra wereacquired on a Varian Unityplus 600 MHz spectrometer at 25° C. Sampleswere approximately 1 mM, except for α70 (SEQ ID NO:39) which had limitedsolubility (100 μM). For hydrogen exchange studies, an NMR sample wasprepared, the pH was adjusted to 5.5 and a spectrum was acquired toserve as an unexchanged reference. This sample was lyophilized,reconstituted in D₂O and repetitive acquisition of spectra was begunimmediately at a rate of 75 s per spectrum. Data acquisition continuedfor ˜20 hours, then the sample was heated to 99° C. for three minutes tofully exchange all protons. After cooling to 25° C., a final spectrumwas acquired to serve as the fully exchanged reference. The areas of allexchangeable amide peaks were normalized by a set of non-exchangingaliphatic peaks. pH values, uncorrected for isotope effects, weremeasured for all the samples after data acquisition and the time axiswas normalized to correct for minor differences in pH (Rohl, et al.,Biochem. 31:1263 (1992)).

α90 (SEQ ID NO:43) and α85 (SEQ ID NO:42) have ellipticities and spectravery similar to Gβ1 (SEQ ID NO:38) (not shown), suggesting that theirsecondary structure content is comparable to that of Gβ1 (SEQ ID NO:38)(FIG. 10). Conversely, α70 (SEQ ID NO:39) has much weaker ellipticityand a perturbed spectrum, implying a loss of secondary structurerelative to Gβ1 (SEQ ID NO:38). α107 (SEQ ID NO:46) has a spectrumcharacteristic of a random coil. Thermal melts monitored by CD are shownin FIG. 10B. α85 (SEQ ID NO:42) and α90 (SEQ ID NO:43) both havecooperative transitions with melting temperatures (T_(m)'s) of 83° C.and 92° C., respectively. α107 (SEQ ID NO:46) shows no thermaltransition, behavior expected from a fully unfolded polypeptide, and α70(SEQ ID NO: 39) has a broad, shallow transition, centered at ˜40° C.,characteristic of partially folded structures. Relative to Gb1 (SEQ IDNO:38), which has a T_(m) of 87° C. (Alexander, et al., supra), α85 (SEQID NO:42) is slightly less thermostable and α90 (SEQ ID NO:43) is morestable. Chemical denaturation measurements of the free energy ofunfolding (ΔG_(u)) at 25° C. match the trend in T_(m)'s.

α90 (SEQ ID NO:43) has a larger ΔG_(u) than that reported for Gβ1 (SEQID NO:38) (Alexander, et al., supra) while α85 (SEQ ID NO:42) isslightly less stable. It was not possible to measure ΔG_(u) for α 70(SEQ ID NO:39) or α107 (SEQ ID NO:46) because they lack discernibletransitions.

The extent of chemical shift dispersion in the proton NMR spectrum ofeach protein was assessed to gauge each protein's degree of native-likecharacter (data not shown). α90 (SEQ ID NO:43) possesses a highlydispersed spectrum, the hallmark of a well-ordered native protein. α85(SEQ ID NO:42) has diminished chemical shift dispersion and peaks thatare somewhat broadened relative to α90 (SEQ ID NO:43), suggesting amoderately mobile structure that nevertheless maintains a distinct fold.α70's (SEQ ID NO:39) NMR spectrum has almost no dispersion. The broadpeaks are indicative of a collapsed but disordered and fluctuatingstructure. α107 (SEQ ID NO:46) has a spectrum with sharp lines and nodispersion, which is indicative of an unfolded protein.

Amide hydrogen exchange kinetics are consistent with the conclusionsreached from examination of the proton NMR spectra. Measuring theaverage number of unexchanged amide protons as a function of time foreach of the designed proteins results as follows (data not shown): α90(SEQ ID NO:43) protects ˜13 protons for over 20 hours of exchange at pH5.5 and 25° C. The α90 (SEQ ID NO: 43) exchange curve isindistinguishable from Gβ1's (SEQ ID NO:38) (not shown). α85 (SEQ IDNO:42) also maintains a well-protected set of amide protons, adistinctive feature of ordered native-like proteins. The number ofprotected protons, however, is only about half that of α90 (SEQ IDNO:43). The difference is likely due to higher flexibility in some partsof the α85 (SEQ ID NO:42) structure. In contrast, α70 (SEQ ID NO:39) andα107 (SEQ ID NO:46) were fully exchanged within the three minute deadtime of the experiment, indicating highly dynamic structures.

Near UV CD spectra and the extent of 8-anilino-1-naphthalene sulfonicacid (ANS) binding were used to assess the structural ordering of theproteins. The near UV CD spectra of α85 (SEQ ID NO:42) and α90 (SEQ IDNO:43) have strong peaks as expected for proteins with aromatic residuesfixed in a unique tertiary structure while α70 (SEQ ID NO:39) and α107(SEQ ID NO:46) have featureless spectra indicative of proteins withmobile aromatic residues, such as non-native collapsed states orunfolded proteins. α70 (SEQ ID NO:39) also binds ANS well, as indicatedby a three-fold intensity increase and blue shift of the ANS emissionspectrum. This strong binding suggests that α70 (SEQ ID NO:39) possessesa loosely packed or partially exposured cluster of hydrophobic residuesaccessible to ANS. ANS binds α85 (SEQ ID NO:42) weakly, with only a 25%increase in emission intensity, similar to the association seen for somenative proteins (Semisotnov, et al., Biopolymers 31:119 (1991)). α90(SEQ ID NO:43) and α107 (SEQ ID NO:46) cause no change in ANSfluorescence. All of the proteins migrated as monomers during sizeexclusion chromatography.

In summary, α90 (SEQ ID NO:43) is a well-packed native-like protein byall criteria, and it is more stable than the naturally occurring Gb1(SEQ ID NO:38) sequence, possibly because of increased hydrophobicsurface burial. α85 (SEQ ID NO:42) is also a stable, ordered protein,albeit with greater motional flexibility than α90 (SEQ ID NO:43), asevidenced by its NMR spectrum and hydrogen exchange behavior. α70 (SEQID NO:39) has all the features of a disordered collapsed globule: anon-cooperative thermal transition, no NMR spectral dispersion or amideproton protection, reduced secondary structure content and strong ANSbinding. α107 (SEQ ID NO:46) is a completely unfolded chain, likely dueto its lack of large hydrophobic residues to hold the core together. Theclear trend is a loss of protein ordering as α decreases below 0.90.

The different packing regimes for protein design can be evaluated inlight of the experimental data. In regime 1, with 0.9≦α≦1.05, the designis dominated by packing specificity resulting in well-ordered proteins.In regime 2, with 0.8≦α≦0.9, packing forces are weakened enough to letthe hydrophobic force drive larger residues into the core which producesa stable well-packed protein with somewhat increased structural motion.In regime 3, α<0.8, packing forces are reduced to such an extent thatthe hydrophobic force dominates, resulting in a fluctuating, partiallyfolded structure with no stable core packing. In regime 4, α>1.05, thesteric forces used to implement packing specificity are scaled too highto allow reasonable sequence selection and hence produce an unfoldedprotein. These results indicate that effective protein design requires aconsideration of packing effects. Within the context of a protein designalgorithm, we have quantitatively defined the range of packing forcesnecessary for successful designs. Also, we have demonstrated thatreduced specificity can be used to design protein cores with alternativepackings.

To take advantage of the benefits of reduced packing constraints,protein cores should be designed with the smallest α that still resultsin structurally ordered proteins. The optimal protein sequence fromregime 2, α85 (SEQ ID NO:42), is stable and well packed, suggesting0.8<α≦0.9 as a good range. NMR spectra and hydrogen exchange kinetics,however, clearly show that α85 (SEQ ID NO:42) is not as structurallyordered as α90 (SEQ ID NO:43). The packing arrangements predicted by ourprogram for W43 in α85 (SEQ ID NO:42) and α90 (SEQ ID NO:43) present apossible explanation. For α90 (SEQ ID NO:43), W43 is predicted to packin the core with the same conformation as in the crystal structure ofGβ1 (SEQ ID NO:38). In α85 (SEQ ID NO:42), the larger side chains atpositions 34 and 54, leucine and phenylalanine respectively, compared toalanine and valine in α90 (SEQ ID NO:43), force W43 to expose 91 Å² ofnonpolar surface compared to 19 Å² in α90 (SEQ ID NO:43). Thehydrophobic driving force this exposure represents seems likely tostabilize alternate conformations that bury W43 and thereby couldcontribute to α85's (SEQ ID NO:42) conformational flexibility (Dill,1985; Onuchic, et al., 1996). In contrast to the other core positions, aresidue at position 43 can be mostly exposed or mostly buried dependingon its side-chain conformation. We designate positions with thischaracteristic as boundary positions, which pose a difficult problem forprotein design because of their potential to either strongly interactwith the protein's core or with solvent.

A scoring function that penalizes the exposure of hydrophobic surfacearea might assist in the design of boundary residues. Dill and coworkersused an exposure penalty to improve protein designs in a theoreticalstudy (Sun, et al., Protein Eng. 8(12)1205-1213(1995)).

A nonpolar exposure penalty would favor packing arrangements that eitherbury large side chains in the core or replace the exposed amino acidwith a smaller or more polar one. We implemented a side-chain nonpolarexposure penalty in our optimization framework and used a penalizingsolvation parameter with the same magnitude as the hydrophobic burialparameter.

The results of adding a hydrophobic surface exposure penalty to ourscoring function are shown in Table 7 (SEQ ID NO:42, SEQ ID NO:43 andSEQ ID NO:49 through SEQ ID NO:61). TABLE 7 α=0.85 TYR LEU LEU ALA ALAPHE ALA VAL TRP PHE VAL # A_(np) 3 5 7 20 26 30 34 39 43 52 54 1 (SEQ ID109 PHE | ILE | | | LEU ILE | TRP PHE NO:42) 2 (SEQ ID 109 | | ILE | | |LEU ILE | TRP PHE NO:49) 3 (SEQ ID 104 PHE | ILE | | | LEU ILE | | PHENO:50) 4 (SEQ ID 104 | | ILE | | | LEU ILE | | PHE NO:51) 5 (SEQ ID 108PHE | ILE | | | LEU | | TRP PHE NO:52) 6 (SEQ ID 62 PHE | ILE | | | LEUILE VAL TRP PHE NO:53) 7 (SEQ ID 103 PHE | ILE | | | LEU ILE | TYR PHENO:54) 8 (SEQ ID 109 PHE | VAL | | | LEU ILE | TRP PHE NO:55) 9 (SEQ ID30 PHE | ILE | | | | ILE | | | NO:43) 10 (SEQ ID 38 PHE | ILE | | | |ILE | TRP | NO:56) 11 (SEQ ID 108 | | ILE | | | LEU | | TRP PHE NO:57)12 (SEQ ID 62 | | ILE | | | LEU ILE VAL TRP PHE NO:58) 13 (SEQ ID 109PHE | ILE | | TYR LEU ILE | TRP PHE NO:59) 14 (SEQ ID 103 | | ILE | | |LEU ILE | TYR PHE NO:60) 15 (SEQ ID 109 | | VAL | | | LEU ILE | TRP PHENO:61)

Table 7 (SEQ ID NO:42, SEQ ID NO:43 and SEQ ID NO:49 through SEQ IDNO:61) depicts the 15 best sequences for the core positions of Gβ1 (SEQID NO:38) using α=0.85 without an exposure penalty. A_(np) is theexposed nonpolar surface area in A².

When α=0.85 the nonpolar exposure penalty dramatically alters theordering of low energy sequences. The α85 (SEQ ID NO:42) sequence, theformer ground state, drops to 7^(th) and the rest of the 15 bestsequences expose far less hydrophobic area because they bury W43 in aconformation similar to α90 (SEQ ID NO:43) (model not shown). Theexceptions are the 8^(th) (SEQ ID NO:55) and 14^(th) (SEQ ID NO:60)sequences, which reduce the size of the exposed boundary residue byreplacing W43 with an isoleucine, and the 13^(th) best sequence (SEQ IDNO:59) which replaces W43 with a valine. The new ground state sequenceis very similar to α90 (SEQ ID NO:43), with a single valine toisoleucine mutation, and should share α90's (SEQ ID NO:43) stability andstructural order. In contrast, when α=0.90, the optimal sequence doesnot change and the next 14 best sequences (SEQ ID NO:60), found by MonteCarlo sampling, change very little. This minor effect is not surprising,since steric forces still dominate for α=0.90 and most of thesesequences expose very little surface area. Burying W43 restrictssequence selection in the core somewhat, but the reduced packing forcesfor α=0.85 still produce more sequence variety than α=0.90. The exposurepenalty complements the use of reduced packing specificity by limitingthe gross overpacking and solvent exposure that occurs when the core'sboundary is disrupted. Adding this constraint should allow lower packingforces to be used in protein design, resulting in a broader range ofhigh-scoring sequences and reduced bias from fixed backbone and discreterotamers.

To examine the effect of substituting a smaller residue at a boundaryposition, we synthesized and characterized the 13^(th) best sequence ofthe α=0.85 optimization with exposure penalty (Table 8) (SEQ ID NO:42,SEQ ID NO:43, SEQ ID NO:49, SEQ ID NO:53 and SEQ ID NO:62 to SEQ IDNO:72). TABLE 8 α=0.85 exposure penalty TYR LEU LEU ALA ALA PHE ALA VALTRP PHE VAL # A_(np) 3 5 7 20 26 30 34 39 43 52 54 1 (SEQ ID 30 PHE |ILE | | | | ILE | | ILE NO:62) 2 (SEQ ID 29 PHE | ILE | | | ILE ILE | || NO:63) 3 (SEQ ID 29 PHE ILE PHE | | | | ILE | | | NO:64) 4 (SEQ ID 30| | ILE | | | | ILE | | ILE NO:65) 5 (SEQ ID 29 | | ILE | | | ILE ILE || | NO:66) 6 (SEQ ID 29 | ILE PHE | | | | ILE | | | NO:67) 7 (SEQ ID 109PHE | ILE | | | LEU ILE | TRP PHE NO:42) 8 (SEQ ID 52 PHE | ILE | | |LEU ILE ILE | PHE NO:68) 9 (SEQ ID 29 | | ILE | | | | ILE | | | NO:69)10 (SEQ ID 29 PHE | ILE | | | | ILE | | | NO:43) 11 (SEQ ID 109 | | ILE| | | LEU ILE | TRP PHE NO:49) 12 (SEQ ID 38 PHE | ILE | | | | ILE | TRPILE NO:70) 13 (SEQ ID 62 PHE | ILE | | | LEU ILE VAL TRP PHE NO:53) 14(SEQ ID 52 | | ILE | | | LEU ILE ILE | PHE NO:71) 15 (SEQ ID 30 PHE |ILE | | | | ILE | TYR ILE NO:72)

Table 8 (SEQ ID NO:42, SEQ ID NO:43, SEQ ID NO:49, SEQ ID NO:53 and SEQID NO:62 to SEQ ID NO:72) depicts the 15 best sequences of the corepositions of Gβ1 (SEQ ID NO:38) using α=0.85 with an exposure penalty.A_(np) is the exposed nonpolar surface area in A².

This sequence, α85W43V, replaces W43 with a valine but is otherwiseidentical to α85 (SEQ ID NO: 42). Though the 8^(th) (SEQ ID NO:68) and14^(th) (SEQ ID NO:71) sequences also have a smaller side chain atposition 43, additional changes in their sequences relative to α85 (SEQID NO:42) would complicate interpretation of the effect of the boundaryposition change. Also, α85W43V has a significantly different packingarrangement compared to Gβ1 (SEQ ID NO:38), with 7 out of 11 positionsaltered, but only an 8% increase in side-chain volume. Hence, α85W43V isa test of the tolerance of this fold to a different, but nearly volumeconserving, core. The far UV CD spectrum of α85W43V is very similar tothat of Gβ1 (SEQ ID NO:38) with an ellipticity at 218 nm of −14000 degcm²/dmol. While the secondary structure content of α85W43V isnative-like, its T_(m) is 65° C., nearly 20° C. lower than α85 (SEQ IDNO:42). In contrast to α85W43V's decreased stability, its NMR spectrumhas greater chemical shift dispersion than α85 (SEQ ID NO:42) (data notshown). The amide hydrogen exchange kinetics show a well protected setof about four protons after 20 hours (data not shown). This fasterexchange relative to α85 (SEQ ID NO:42) is explained by α85W43V'ssignificantly lower stability (Mayo & Baldwin, 1993). α85W43V appears tohave improved structural specificity at the expense of stability, aphenomenon observed previously in coiled coils (Harbury, et al., 1993).By using an exposure penalty, the design algorithm produced a proteinwith greater native-like character.

We have quantitatively defined the role of packing specificity inprotein design and have provided practical bounds for the role of stericforces in our protein design program. This study differs from previouswork because of the use of an objective, quantitative program to varypacking forces during design, which allows us to readily apply ourconclusions to different protein systems. Further, by using the minimumeffective level of steric forces, we were able to design a wider varietyof packing arrangements that were compatible with the given fold.Finally, we have identified a difficulty in the design of side chainsthat lie at the boundary between the core and the surface of a protein,and we have implemented a nonpolar surface exposure penalty in oursequence design scoring function that addresses this problem.

Example 5

Design of a Full Protein

The entire amino acid sequence of a protein motif has been computed. Asin Example 4, the second zinc finger module of the DNA binding proteinZif268 (SEQ ID NO:1) was selected as the design template. In order toassign the residue positions in the template structure into core,surface or boundary classes, the orientation of the Cα-Cβ vectors wasassessed relative to a solvent accessible surface computed using onlythe template Cα atoms. A solvent accessible surface for only the Cαatoms of the target fold was generated using the Connolly algorithm witha probe radius of 8.0 Å, a dot density of 10 Å², and a Ca radius of 1.95Å. A residue was classified as a core position if the distance from itsCα, along its Cα-Cβ vector, to the solvent accessible surface wasgreater than 5 Å, and if the distance from its Cβ to the nearest surfacepoint was greater than 2.0 Å. The remaining residues were classified assurface positions if the sum of the distances from their Cα, along theirCα-Cβ vector, to the solvent accessible surface plus the distance fromtheir Cβ to the nearest surface point was less than 2.7 Å. All remainingresidues were classified as boundary positions. The classifications forZif268 (SEQ ID NO:1) were used as computed except that positions 1, 17and 23 were converted from the boundary to the surface class to accountfor end effects from the proximity of chain termini to these residues inthe teriary structure and inaccuracies in the assignment.

The small size of this motif limits to one (position 5) the number ofresidues that can be assigned unambiguously to the core while sevenresidues (positions 3, 7, 12, 18, 21, 22, and 25) were classified asboundary and the remaining 20 residues were assigned to the surface.Interestingly, while three of the zinc binding positions of Zif268 (SEQID NO:1) are in the boundary or core, one residue, position 8, has aCα-Cβ vector directed away from the protein's geometric center and isclassified as a surface position. As in our previous studies, the aminoacids considered at the core positions during sequence selection were A,V, L, I, F, Y, and W; the amino acids considered at the surfacepositions were A, S, T, H, D, N, E, Q, K, and R; and the combined coreand surface amino acid sets (16 amino acids) were considered at theboundary positions. Two of the residue positions (9 and 27) have φangles greater than 0° and are set to Gly by the sequence selectionalgorithm to minimize backbone strain.

The total number of amino acid sequences that must be considered by thedesign algorithm is the product of the number of possible amino acidtypes at each residue position. The ββα motif residue classificationdescribed above results in a virtual combinatorial library of 1.9×10²⁷possible amino acid sequences (one core position with 7 possible aminoacids, 7 boundary positions with 16 possible amino acids, 18 surfacepositions with 10 possible amino acids and 2 positions with φ anglesgreater than 0° each with 1 possible amino acid). A correspondingpeptide library consisting of only a single molecule for each 28 residuesequence would have a mass of 11.6 metric tons. In order to accuratelymodel the geometric specificity of side-chain placement, we explicitlyconsider the torsional flexibility of amino acid side chains in oursequence scoring by representing each amino acid with a discrete set ofallowed conformations, called rotamers. As above, a backbone dependentrotatmer library was used (Dunbrack and Karplus, supra), withadjustments in the χ₁ and χ₂ angles of hydrophobic residues. As aresult, the design algorithm must consider all rotamers for eachpossible amino acid at each residue position. The total size of thesearch space for the ββα motif is therefore 1.1×10⁶² possible rotamersequences. The rotamer optimization problem for the ββα motif required90 CPU hours to find the optimal sequence.

The optimal sequence, shown in FIG. 11, is called Full Sequence Design-1(FSD-1) (SEQ ID NO:3). Even though all of the hydrophilic amino acidswere considered at each of the boundary positions, the algorithmselected only nonpolar amino acids. The eight core and boundarypositions are predicted to form a well-packed buried cluster. The Pheside chains selected by the algorithm at the zinc binding His positionsof Zif268 (SEQ ID NO:1), positions 21 and 25, are over 80% buried andthe Ala at position 5 is 100% buried while the Lys at position 8 isgreater than 60% exposed to solvent. The other boundary positionsdemonstrate the strong steric constraints on buried residues by packingsimilar side chains in an arrangement similar to that of Zif268 (SEQ IDNO:1). The calculated optimal configuration for core and boundaryresidues buries ˜1150 Å² of nonpolar surface area. On the helix surface,the program positions Asn 14 as a helix N-cap with a hydrogen bondbetween its side-chain carbonyl oxygen and the backbone amide proton ofresidue 16. The eight charged residues on the helix form three pairs ofhydrogen bonds, though in our coiled coil designs helical surfacehydrogen bonds appeared to be less important than the overall helixpropensity of the sequence (Dahiyat, et al., Science (1997)). Positions4 and 11 on the exposed sheet surface were selected to be Thr, one ofthe best β-sheet forming residues (Kim, et al. 1993).

FIG. 11 (SEQ ID NO:4 to SEQ ID NO:22) shows the alignment of thesequences for FSD-1 (SEQ ID NO:3) and Zif268 (SEQ ID NO:1). Only 6 ofthe 28 residues (21%) are identical and only 11 (39%) are similar. Fourof the identities are in the buried cluster, which is consistent withthe expectation that buried residues are more conserved than solventexposed residues for a given motif (Bowie, et al., Science 247:1306-1310(1990)). A BLAST (Altschul, et al., supra) search of the FSD-1 sequence(SEQ ID NO:3) against the non-redundant protein sequence database of theNational Center for Biotechnology Information did not find any zincfinger protein sequences. Further, the BLAST search found only lowidentity matches of weak statistical significance to fragments ofvarious unrelated proteins. The highest identity matches were 10residues (36%) with p values ranging from 0.63-1.0. Random 28 residuesequences that consist of amino acids allowed in the ββα positionclassification described above produced similar BLAST search results,with 10 or 11 residue identities (36-39%) and p values ranging from0.35-1.0, further suggesting that the matches found for FSD-1 arestatistically insignificant. The very low identity to any known proteinsequence demonstrates the novelty of the FSD-1 sequence (SEQ ID NO:3)and underscores that no sequence information from any protein motif wasused in our sequence scoring function.

In order to examine the robustness of the computed sequence, thesequence of FSD-1 (SEQ ID NO:3) was used as the starting point of aMonte Carlo simulated annealing run. The Monte Carlo search finds highscoring, suboptimal sequences in the neighborhood of the optimalsolution (Dahiyat, et al., (1996) (supra)). The energy spread from theground-state solution to the 1000^(th) most stable sequence is about 5kcal/mol indicating that the density of states is high. The amino acidscomprising the core of the molecule, with the exception of position 7,are essentially invariant (FIG. 11) (SEQ ID NO:4 to SEQ ID NO:22).Almost all of the sequence variation occurs at surface positions, andtypically involves conservative changes. Asn 14, which is predicted toform a helix N-cap, is among the most conserved surface positions. Thestrong sequence conservation observed for critical areas of the moleculesuggests that if a representative sequence folds into the design targetstructure, then perhaps thousands of sequences whose variations do notdisrupt the critical interactions may be equally competent. Even ifbillions of sequences would successfully achieve the target fold, theywould represent only a vanishingly small proportion of the 10²⁷ possiblesequences.

Experimental validation. FSD-1 (SEQ ID NO:3) was synthesized in order tocharacterize its structure and assess the performance of the designalgorithm. The far UV circular dichroism (CD) spectrum of FSD-1 (SEQ IDNO:3) shows minima at 220 nm and 207 nm, which is indicative of a foldedstructure (data not shown). The thermal melt is weakly cooperative, withan inflection point at 39° C., and is completely reversible (data notshown). The broad melt is consistent with a low enthalpy of foldingwhich is expected for a motif with a small hydrophobic core. Thisbehavior contrasts the uncooperative thermal unfolding transitionsobserved for other folded short peptides (Scholtz, et al., 1991). FSD-1(SEQ ID NO:3) is highly soluble (greater than 3 mM) and equilibriumsedimentation studies at 100 μM, 500 μM and 1 mM show the protein to bemonomeric. The sedimentation data fit well to a single species, monomermodel with a molecular mass of 3630 at 1 mM, in good agreement with thecalculated monomer mass of 3488. Also, far UV CD spectra showed noconcentration dependence from 50 μM to 2 mM, and nuclear magneticresonance (NMR) COSY spectra taken at 100 μM and 2 mM were essentiallyidentical.

The solution structure of FSD-1 (SEQ ID NO:3) was solved usinghomonuclear 2D ¹H NMR spectroscopy (Piantini, et al., 1982). NMR spectrawere well dispersed indicating an ordered protein structure and easingresonance assignments. Proton chemical shift assignments were determinedwith standard homonuclear methods (Wuthrich, 1986). Unambiguoussequential and short-range NOEs indicate helical secondary structurefrom residues 15 to 26 in agreement with the design target.

The structure of FSD-1 (SEQ ID NO:3) was determined using 284experimental restraints (10.1 restraints per residue) that werenon-redundant with covalent structure including 274 NOE distancerestraints and 10 hydrogen bond restraints involving slowly exchangingamide protons. Structure calculations were performed using X-PLOR(Brunger, 1992) with standard protocols for hybrid distancegeometry-simulated annealing (Nilges, et al., FEBS Lett. 229:317(1988)). An ensemble of 41 structures converged with good covalentgeometry and no distance restraint violations greater than 0.3 Å (Table9). TABLE 9 NMR structure determination: distance restraints, structuralstatistics and atomic root-mean-square (rms) deviations. <SA> are the 41simulated annealing structures, SA is the average structure beforeenergy minimization, (SA)_(r) is the restrained energy minimized averagestructure, and SD is the standard deviation. Distance restraintsIntraresidue 97 Sequential 83 Short range (|i-j| = 2-5 59 residues) Longrange (|i-j| > 5 35 residues) Hydrogen bond 10 Total 284 Structuralstatistics <SA> ± SD (SA)_(r) Rms deviation from 0.043 ± 0.003 0.038distance restraints (Å) Rms deviation from idealized geometry Bonds (Å)0.0041 ± 0.0002 0.0037 Angles 0.67 ± 0.02 0.65 (degrees) Impropers 0.53± 0.05 0.51 (degrees) Atomic rms deviations (Å)* <SA> vs. SA ± SD <SA>vs. (SA)_(r) ± SD Backbone 0.54 ± 0.15 0.69 ± 0.16 Backbone + nonpolar0.99 ± 0.17 1.16 ± 0.18 side chains^(†) Heavy atoms 1.43 ± 0.20 1.90 ±0.29*Atomic rms deviations are for residues 3 to 26, inclusive. Residues 1,2, 27 and 28 were disordered (φ, ψ angular order parameters (34) < 0.78)and had only sequential and |i-j| = 2 NOEs.^(†)Nonpolar side chains are from residues 3, 5, 7, 12, 18, 21, 22, and25 which consitute the core of the protein.

The backbone of FSD-1 (SEQ ID NO:3) is well defined with aroot-mean-square (rms) deviation from the mean of 0.54 Å (residues3-26). Considering the buried side chains (residues 3, 5, 7, 12, 18, 21,22, and 25) in addition to the backbone gives an rms deviation of 0.99Å, indicating that the core of the molecule is well ordered. Thestereochemical quality of the ensemble of structures was examined usingPROCHECK (Laskowski, et al., J. Appl. Crystallogr. 26:283 (1993)). Notincluding the disordered termini and the glycine residues, 87% of theresidues fall in the most favored region and the remainder in theallowed region of φ, ψ space. Modest heterogeneity is present in thefirst strand (residues 3-6) which has an average backbone angular orderparameter (Hyberts, et al., 1992) of <S>=0.96±0.04 compared to thesecond strand (residues 9-12) with an <S>=0.98±0.02 and the helix(residues 15-26) with an <S>=0.99±0.01. Overall, FSD-1 is notably wellordered and, to our knowledge, is the shortest sequence consistingentirely of naturally occurring amino acids that folds to a uniquestructure without metal binding, oligomerization or disulfide bondformation (McKnight, et al., 1997).

The packing pattern of the hydrophobic core of the NMR structureensemble of FSD-1 (SEQ ID NO:3) (Tyr 3, Ile 7, Phe 12, Leu 18, Phe 21,Ile 22, and Phe 25) is similar to the computed packing arrangement. Fiveof the seven residues have χ₁ angles in the same gauche⁺, gauche⁻ ortrans category as the design target, and three residues match both χ₁and χ₂ angles. The two residues that do not match their computed χ₁angles are Ile 7 and Phe 25, which is consistent with their location atthe less constrained, open end of the molecule. Ala 5 is not involved inits expected extensive packing interactions and instead exposes about45% of its surface area because of the displacement of the strand 1backbone relative to the design template. Conversely, Lys 8 behaves aspredicted by the algorithm with its solvent exposure (60%) and χ₁ and χ₂angles matching the computed structure. Most of the solvent exposedresidues are disordered which precludes examination of the predictedsurface residue hydrogen bonds. Asn 14, however, forms a helix N-capfrom its sidechain carbonyl oxygen as predicted, but to the amide of Glu17, not Lys 16 as expected from the design. This hydrogen bond ispresent in 95% of the structure ensemble and has a donor-acceptordistance of 2.6±0.06 Å. In general, the side chains of FSD-1 (SEQ IDNO:3) correspond well with the design program predictions.

A comparison of the average restrained minimized structure of FSD-1 (SEQID NO:3) and the design target was done (data not shown). The overallbackbone rms deviation of FSD-1 (SEQ ID NO:3) from the design target is1.98 Å for residues 3-26 and only 0.98 Å for residues 8-26 (Table 10).TABLE 10 Comparison of the FSD-1 (SEQ ID NO: 3) experimentallydetermined structure and the design target structure. The FSD-1 (SEQ IDNO: 3) structure is the restrained energy minimized average from the NMRstructure determination. The design target structure is the second DNAbinding module of the zinc finger Zif268 (9) (SEQ ID NO: 1). Atomic rmsdeviations (Å) Backbone, residues 1.98 3-26 Backbone, residues 0.98 8-26Super-secondary structure parameters* FSD-1 Design Target h (Å) 9.9 8.9θ (degrees) 14.2 16.5 Ω (degrees) 13.1 13.5*h, θ, Ω are calculated as previously described (36, 37). h is thedistance between the centroid of the helix Cα coordinates (residues15-26) and the least-square plane fit to the Cα coordinates of the sheet(residues 3-12. θ is the angle of inclination of the principal moment ofthe helix Cα atoms with the plane of the sheet. Ω is the angle betweenthe projection of the principal moment of the helix onto the# sheet and the projection of the average least-square fit line to thestrand Cα coordinates (residues 3-6 and 9-12) onto the sheet.

The largest difference between FSD-1 (SEQ ID NO:3) and the targetstructure occurs from residues 4-7, with a displacement of 3.0-3.5 Å ofthe backbone atom positions of strand 1. The agreement for strand 2, thestrand to helix turn, and the helix is remarkable, with the differencesnearly within the accuracy of the structure determination. For thisregion of the structure, the rms difference of φ, ψ angles between FSD-1(SEQ ID NO:3) and the design target is only 14±9°. In order toquantitatively assess the similarity of FSD-1 (SEQ ID NO:3) to theglobal fold of the target, we calculated their supersecondary structureparameters (Table 9) (Janin & Chothia, J. Mol. Biol. 143:95 (1980); Su &Mayo, Protein Sci. in press, 1997), which describe the relativeorientations of secondary structure units in proteins. The values of θ,the inclination of the helix relative to the sheet, and Ω, the dihedralangle between the helix axis and the strand axes, are nearly identical.The height of the helix above the sheet, h, is only 1 Å greater in FSD-1(SEQ ID NO:3). A study of protein core design as a function of helixheight for Gb1 variants demonstrated that up to 1.5 Å variation in helixheight has little effect on sequence selection (Su & Mayo, supra, 1997).The comparison of secondary structure parameter values and backbonecoordinates highlights the excellent agreement between theexperimentally determined structure of FSD-1 (SEQ ID NO:3) and thedesign target, and demonstrates the success of our algorithm atcomputing a sequence for this ββα motif.

The quality of the match between FSD-1 (SEQ ID NO:3) and the designtarget demonstrates the ability of our program to design a sequence fora fold that contains the three major secondary structure elements ofproteins: sheet, helix, and turn. Since the ββα fold is different fromthose used to develop the sequence selection methodology, the design ofFSD-1 (SEQ ID NO:3) represents a successful transfer of our program to anew motif.

Example 6

Calculation of Solvent Accessible Surface Area Scaling Factors

In contrast to the previous work, backbone atoms are included in thecalculation of surface areas. Thus, the calculation of the scalingfactors proceeds as follows.

The program BIOGRAF (Molecular Simulations Incorporated, San Diego,Calif.) was used to generate explicit hydrogens on the structures whichwere then conjugate gradient minimized for 50 steps using the DREIDINGforce field. Surface areas were calculated using the Connolly algorithmwith a dot density of 10 Å−2, using a probe radius of zero and an add-onradius of 1.4 Å and atomic radii from the DREIDING force-field. Atomsthat contribute to the hydrophobic surface area are carbon, sulfur andhydrogen atoms attached to carbon and sulfur.

For each side-chain rotamer r at residue position i with a localtri-peptide backbone t3, we calculated A_(i_(r)t3)⁰the exposed area of the rotamer and its backbone in the presence of thelocal tri-peptide backbone, and A_(i) _(r) _(t) the exposed area of therotamer and its backbone in the presence of the entire template t whichincludes the protein backbone and any side-chains not involved in thecalculation (FIG. 13). The difference between A_(i_(r)t3)⁰,and A_(i) _(r) _(t) is the total area buried by the template for arotamer r at residue position i. For each pair of residue positions iand j and rotamers r and s on i and j, respectively, A_(i) _(r) _(j)_(s) _(t) the exposed area of the rotamer pair in the presence of theentire template, is calculated. The difference between A_(i) _(r) _(j)_(s) _(t) and the sum of A_(i) _(r) _(t) and A_(j) _(s) _(t) is the areaburied between residues i and j, excluding that area by the template.The pairwise approximation to the total buried surface area is:Equation  29:$A_{buried}^{pairwise} = {{\sum\limits_{i}\left( {A_{i_{r}{t3}}^{0} - A_{i_{r}t}} \right)} + {f{\sum\limits_{i < j}\left( {A_{i_{r}t} + A_{j_{s}t} - A_{i_{r}j_{s}t}} \right)}}}$

As shown in FIG. 13, the second sum in Equation 29 over-counts theburied area. We have therefore multiplied the second sum by a scalefactor f whose value is to be determined empirically. Expected values off are discussed below.

Noting that the buried and exposed areas should add to the total area,Σ_(i)A_(i_(r)t3)⁰,the solvent-exposed surface area is: Equation  30:$A_{exposed}^{pairwise} = {{\sum\limits_{i}A_{i_{r}t}} - {f{\sum\limits_{i < j}\left( {A_{i_{r}t} + A_{j_{s}t} - A_{i_{r}j_{s}t}} \right)}}}$

The first sum of Equation 30 represents the total exposed area of eachrotamer in the context of the protein template ignoring interactionswith other rotamers. The second sum of Equation 30 subtracts the buriedareas between rotamers and is scaled by the same parameter f as inEquation 29.

Some insight into the expected value of f can be gained fromconsideration of a close-packed face centered cubic lattice of spheresor radius r. When the radii are increased from r to R, the surface areaon one sphere buried by a neighboring sphere is 2πR(R−r). We take r tobe a carbon radius (1.95 Å), and R is 1.4 Å larger. Then, using:$f = \frac{\text{true~~buried~~area}}{\text{pairwise~~buried~~area}}$and noting that each sphere has 12 neighbors, results in:$f = \frac{4\quad\pi\quad R_{2}}{12x\quad 2\quad\pi\quad{R\left( {R - r} \right)}}$

This yields f=0.40. A close-packed face centered cubic lattice has apacking fraction of 74%. Protein interiors have a similar packingfraction, although because many atoms are covalently bonded the closepacking is exaggerated. Therefore this value of f should be a lowerbound for real protein cores. For non-core residues, where the packingfraction is lower, a somewhat larger value of f is expected.

We classified residues from ten proteins ranging in size from 54 to 289residues into core or non-core as follows. We classified resides as coreor non-core using an algorithm that considered the direction of eachside-chain's Cα-Cβ vector relative to the surface computed using onlythe template Cα atoms with a carbon radius of 1.95 Å, a probe radius of8 Å and no add-on radius. A residue was classified as a core position ifboth the distance from its Cα atom (along its Cα-Cβ vector) to thesurface was greater than 5.0 Å and the distance from its Cβ atom to thenearest point on the surface was greater than 2.0 Å. The advantage ofsuch an algorithm is that a knowledge of the amino acid type actuallypresent at each residue position is not necessary. The proteins were asshown in Table I, showing selected proteins, total number of residuesand the number of residues in the core and non-core of each protein (Glyand pro were not considered). Brookhaven Identifier Total Size Core SizeNon-Core Size 1enh 54 10 40 1pga 56 10 40 1ubi 76 16 50 1mol 94 19 611kpt 105 27 60 4azu-A 128 39 71 1gpr 158 39 89 1gcs 174 53 98 1edt 26695 133 1pbn 289 96 143

The classification into core and non-core was made because core residuesinteract more strongly with one another than do non-core residues. Thisleads to greater over-counting of the buried surface area for coreresidues.

Considering the core and non-core cases separately, the value of f whichmost closely reproduced the true Lee and Richards surface areas wascalculated for the ten proteins. The pairwise approximation very closelymatches the true buried surface area (data not shown). It also performsvery well for the exposed hydrophobic surface area of non-core residues(data not shown). The calculation of the exposed surface area of theentire core of a protein involves the difference of two large and nearlyequal areas and is less accurate; as will be shown, however, when thereis a mixture of core and non-core residues, a high accuracy can still beachieved. These calculations indicate that for core residues f is 0.42and for non-core residues f is 0.79.

To test whether the classification of residues into core and non-corewas sufficient, we examined subsets of interacting residues in the coreand non-core positions, and compared the true buried area of each subsetwith that calculated (using the above values of f). For both subsets ofthe core and the non-core, the correlation remained high (R²=1.00)indicating that no further classification is necessary (data not shown).(Subsets were generated as follows: given a seed residue, a subset ofsize two was generated by adding the closest residue: the next closestresidue was added for a subset of size three, and this was repeated upto the size of the protein. Additional subsets were generated byselecting different seed residues.)

It remains to apply this approach to calculating the buried or exposedsurface areas of an arbitrary selection of interacting core and non-coreresidues in a protein. When a core residue and a non-core residueinteract, we replace Equation 29 with: Equation  31:$A_{buried}^{pairwise} = {{\sum\limits_{i}\left( {A_{i_{r}{t3}}^{0} - A_{i_{r}t}} \right)} + {\sum\limits_{i < j}\left( {{f_{i}A_{i_{r}t}} + {f_{j}A_{j_{f}t}} - {f_{ij}A_{i_{r}j_{f}t}}} \right)}}$and Equation 30 with Equation 32:$A_{exposed}^{pairwise} = {{\sum\limits_{i}A_{i_{r}t}} - {\sum\limits_{i < j}\left( {{f_{i}A_{i_{r}t}} + {f_{j}A_{j_{f}t}} - {f_{ij}A_{i_{r}j_{f}t}}} \right)}}$where f_(i) and f_(j) are the values of f appropriate for residues i andj, respectively, and f_({ij}) takes on an intermediate value. Usingsubsets from the whole of 1 pga, the optimal value of f_(ij) was foundto be 0.74. This value was then shown to be appropriate for other testproteins (data not shown).

1. A method executed by a computer under the control of a program, saidcomputer including a memory for storing said program, said methodcomprising the steps of: a) receiving a set of three dimensionalcoordinates for a target protein with a plurality of variable residuepositions; b) computationally generating a first set of rank orderedoptimized variant sequences using at least one scoring function, whereinsaid variant sequences comprise at least one variant amino acid at atleast one variant residue position; and c) computationally generating asecond set of suboptimal variant sequences from said set.
 2. A methodaccording to claim 1 wherein said first set comprises the global minimumusing said scoring function.
 3. A method according to claim 1 whereinsaid first set consists of the global minimum using said scoringfunction.
 4. A method according to claim 1 wherein said second set isgenerated using a Monte Carlo search.
 5. A method according to claim 1wherein said variable residue positions are classified as core, surfaceor boundary residues.
 6. A method according to claim 1 wherein saidscoring function is selected from the group consisting of a van derWaals potential scoring function, a hydrogen bond potential scoringfunction, an atomic solvation scoring function, an electrostatic scoringfunction and a secondary structure propensity scoring function.
 7. Amethod according to claim 1 wherein said computational generation ofsaid first set utilizes at least two scoring functions.
 8. A methodaccording to claim 1 wherein said computational generation of said firstset utilizes at least three scoring functions.
 9. A method according toclaim 1 wherein said computational generation of said first set utilizesat least four scoring functions.